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BACKGROUND OF THE INVENTION 
Field of the Invention 

[0002] The present invention relates to a method and system for coding low bit 
rate speech for communication systems. More particularly, the present invention 
relates to a method and apparatus for performing prototype waveform magnitude 
quantization using vector quantization. 
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Background of the Invention 

[0003] Currently, various speech encoding techniques are used to process speech. 
These techniques do not adequately address the need for a speech encoding technique 
that improves the modeling and quantization of a speech signal, specifically, the 
evolving spectral characteristics of a speech prediction residual signal which includes 
a prototype waveform (PW) gain vector, a PW magnitude vector, and a PW phase 
information. 

[0004] In particular, prior art techniques are representative but not limited to the 
following see, e.g., L.R. Rabiner and R.W. Schafer, "Digital Processing of Speech 
Signals" Prentice-Hall 1978 (hereinafter known as reference 1), W.B. Klejin and J. 
Haagen, "Waveform Interpolation for Coding and Synthesis", in Speech Coding and 
Synthesis, Edited by W.B. Klejin, K.K. Paliwal, Elsevier, 1995 (hereinafter known as 
reference 2); F. Iatakura, "Line Spectral Representation of Linear Predictive 
Coefficients of Speech Signals", Journal of Acoustical Society of America, vol4. 57, 
no. 1, 1975 (hereinafter known as reference 3); P. Kabal and R. P. Ramachandran, 
"The Computation of Line Spectral Frequencies Using Chebyshev Polybimials", 
IEEE Trans. On ASSP, vol. 34, no. 6, pp. 1419-1426, Dec. 1986 (hereinafter known 
as reference 4); W.B. Klejin, "Encoding Speech Using Prototype Waveforms" IEEE 
Transactions on Speech and Audio Processing, Vol. 1, No. 4, 386-399, 1993 
(hereinafter known as reference 5); and W.B. Kleijn, Y. Shoman, D. Sen and R. 
Hagen, "A Low Complexity Waveform Interpolation Coder", IEEE International 
Conference on Acoustics, Speech and Signal Processing, 1996 (hereinafter known as 
reference 6). All of the references 1 through 6 are herein incorporated in their entirety 
by reference. 

[0005] The prototype waveforms are a sequence of complex Fourier transforms 
evaluated at pitch harmonic frequencies, for pitch period wide segments of the 
residual, at a series of points along the time axis. Thus, the PW sequence contains 
information about the spectral characteristics of the residual signal as well as the 
temporal evolution of these characteristics. A high quality of speech can be achieved 
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at low coding rates by efficiently quantizing the important aspects of the PW 
sequence. 

[0006] In PW based coders, the PW is separated into a shape component and a 
level component by computing the RMS (or gain) value of the PW and normalizing 
the PW to a unity RMS value. As the pitch frequency varies, the dimensions of the 
PW vectors also vary, typically in the range of 1 1-61 . Existing VQ techniques, such 
as direct VQ, split VQ and multi-stage VQ are not well suited for variable dimension 
vectors. Adaptation of these techniques for variable dimension is not neither practical 
from an implementation viewpoint nor satisfactory from a performance viewpoint. It's 
not practical since the worst case high dimensionality results in a high computational 
cost and a high storage cost. 

[0007] To address the variable dimensionality problem, prior art in reference 4 
uses analytical functions of a fixed order to approximate the variable dimension 
vectors. The coefficients of the analytical function that provide the best fit to the 
vectors are used to represent the vectors for quantization. This approach suffers from 
three disadvantages. First, a modeling error is added to the quantization error, leading 
to a loss in performance. Second, analytical function approximation for reasonable 
orders in the magnitude of 5-10 deteriorate with increasing frequency. Third, if 
spectrally weighted distortion metrics are used during VQ, the complexity of these 
methods become formidable. 

[0008] A PW magnitude vector sequence determines the evolving spectral 
characteristics of a linear predictive (LP) excitation signal and therefore is important 
in signal characterization. Prior art techniques separate the PW sequence into slowly 
evolving (SEW) and rapidly evolving (REW) components. This results in two 
disadvantages. 

[0009] First the algorithmic delay of the coding scheme in prior art is 
significantly increased as it requires linear low pass and high pass filtering to separate 
the SEW and REW components. This delay can be noticeable in telephone 
conversations. 
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[0010] Second, the signal processing in prior art needed for this purpose is 
complicated due to the filtering that is necessary. This increases the computational 
complexity of processing the signal resulting higher cost. 

[OO 1 1] Additionally, prior art techniques use a non-hierachical approach in 
quantizing the PW vectors (see references 2-6). This results in lower CODEC 
performance and less robustness to channel errors. 

[0012] Thus, a need exists for a system and method that can accurately recreate 
perceptually important spectral features of the PW magnitude while maintaining 
computational and storage efficiency. Specifically, this permits the evolving spectral 
features of the LP residual signal to be reproduced accurately at the decoder. 

Summary of the Invention 

[0013] An object of the present invention is to provide a system and method for 
accurately representing the spectral features of the LP residual signal and for 
reproducing the spectral features accurately at the decoder. 

[0014] These and other objects are substantially achieved by a system and method 
employing a frequency domain interpolative CODEC system for low bit rate coding 
of speech. The CODEC comprises a linear prediction (LP) front end adapted to 
process an input signal that provides LP parameters which are quantized and encoded 
over predetermined intervals and used to compute a LP residual signal. An open loop 
pitch estimator adapted to process the LP residual signal, a pitch quantizer, and a pitch 
interpolator and provide a pitch contour within the predetermined intervals is also 
provided. Also provided is a signal processor responsive to the LP residual signal and 
the pitch contour and adapted to perform the following: provide a voicing measure, 
where the voicing measure characterizes a degree of voicing of the input speech 
signal and is derived from several input parameters that are correlated to degrees of 
periodicity of the signal over the predetermined intervals; extract a prototype 
waveform (PW) from the LP residual and the open loop pitch contour for a number of 
equal sub-intervals within the predetermined intervals; normalize the PW by a gain 
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value of the PW; encode a magnitude of the PW; and directly quantize the PW in a 
magnitude domain without further decomposition of the PW into complex 
components, where the direct quantization is performed by a hierarchical quantization 
method based on a voicing classification using fixed dimension vector quantizers 
(VQ's). 



Brief Description of the Drawings 

The various objects, advantages and novel features of the present invention 
will be more readily understood from the following detailed description when read in 
conjunction with the appended drawings, in which: 

FIGs. 1 A and IB are block diagrams of a Frequency Domain Interpolative 
(FDI) coder/decoder (CODEC) for performing coding and decoding of an input voice 
signal in accordance with an embodiment of the present invention; 

FIG. 2 is a block diagram of frame structures for use with the CODEC of 
FIG. 1 in accordance with an embodiment of the present invention; 

FIG. 3 is a flow chart for a method for updating scale factors to limit 
spectral amplitude gain in performing noise reduction in accordance with an 
embodiment of the present invention; 

FIG. 4 is a flow chart for a method for performing tone detection in 
accordance with an embodiment of the present invention; 

FIG. 5 is a block diagram of stationary and nonstationary components of a 
prototype waveform (PW) in accordance with an embodiment of the present 
invention; 

FIG. 6 is a flow chart for a method for enforcing monotonic measures in 
accordance with an embodiment of the present invention; 

FIG. 7 is a flow chart for a method for computing gain averages in 
accordance with an embodiment of the present invention; 

FIG. 8 is a flow chart for a method for computing the attenuation of a PW 
mean high in the unvoiced high frequency band in accordance with an embodiment of 
the present invention; and 
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FIG. 9 is a flow chart for a method for computing the attenuation of a PW 
mean high in the voice high frequency band in accordance with an embodiment of the 

present invention. 

Throughout the drawing figures, like reference numerals will be 
understood to refer to like parts and components. 

Detailed Description of Preferred Embodiments 

[0015] FIGs. 1 A and IB are block diagrams of a Frequency Domain Interpolate 
(FDI) coder/decoder (CODEC) 100 for performing coding and decoding of an input 
voice signal in accordance with an embodiment of the present invention. The FDI 
CODEC 100 comprises a coder portion 100 A which computes prototype waveforms 
(PW) and a decoder portion 100B which reconstructs the PW and speech signal. 
[0016] Specifically, the coder portion 100A illustrates the computation of PW 
from an input speech signal. Voice activity detection (VAD) 102 is performed on the 
input speech to determine whether the input speech is actually speech or noise. The 
VAD 102 provides a VAD flag which indicates whether the input signal was noise or 
speech. The detected signal is then provided to a noise reduction module 104 where 
the noise level for the signal is reduced and provided to a linear predictive (LPC) 
analysis filter module 106. 

[0017] The LPC module 106 provides filtered and residual signals to the 
prototype extraction module 108 as well as LPC parameters to decoder 100B. The 
pitch estimation and interpolation module 1 10 receives the LPC filtered and residual 
signals from the LPC analysis filter module 106 and pitch contours from the prototype 
extraction module 108 and provides a pitch and a pitch gain. 

[0018] The extracted prototype waveform from prototype extraction module 1 08 
is provided to compute prototype gain module 1 12, PW magnitude and computation 
and normalization module 1 14, compute subband nonstationarity measure module 116 
and compute voicing measure module 118. Compute voicing measure (VM) module 
118 also receives the pitch gain from pitch estimation and interpolation module 110 
and computes a voicing measure. 



[0019] The compute prototype gain module 112 computes a prototype gain and 
provides the PW gain value to decoder portion 100B. PW magnitude computation and 
normalization module 114 computes the PW magnitude and normalizes the PW 
magnitude. 

[0020] Compute subband nonstationarity measure module 116 computes a 
subband nonstationarity measure from the extracted prototype waveform. The 
computed subband nonstationarity measure and computed voicing measure are 
provided to a subband nonstationarity measure - Vector quantizer (VQ) module 122 
which processes the received signals. 

[002 1] A PW magnitude quantization module 1 20 receives the computed PW 
magnitude and normalized signal along with the VAD flag indication and quantizes 
the received signal and provides a PW magnitude value to the decoder 100B. 
[0022] The decoder 100B further includes a periodic phase model module 124 
and aperiodic phase model module 126 which receive the PW magnitude value and 
subband nonstationarity measure- voicing measure value from coder 100A and 
compute a periodic phase and an aperiodic phase, respectively, from the received 
signal. The periodic phase model module 124 provides a complex periodic vector 
having a periodic component level and the aperiodic phase model module 126 
provides a complex aperiodic vector having an aperiodic component level to a 
summer which provides a complex PW vector to a normalize PW gain module 128. 
The normalize PW gain module also receives the PW gain value from coder 1 OOA. 
[0023] A pitch interpolation module 130 performs pitch interpolation on a pitch 
period provided by encoder 100A. The normalize PW gain signal and interpolated 
pitch frequency contour signal is provided to an interpolative synthesis module 132 
which performs interpolative synthesis to obtain a reconstructed residual signal from 
the previously mentioned signals. 

[0024] The reconstructed residual signal is provided to an all pole LPC synthesis 
filter module 134 which processes the reconstructed residual signal and provides the 
filtered signal to an adaptive postfilter and tilt correction module 136. Modules 134 
and 136 also receive the VAD flag indication signal and interpolated LPC parameters 



from the encoder 100A. A reconstructed speech signal is provided by the adaptive 
postfilter and tilt correction module 136. 

[002S] Specifically, the FDI codec 100 is based on techniques of linear predictive 
(LP) analysis, robust pitch estimation and frequency domain encoding of the LP 
residual signal. The FDI codec operates on a frame size of preferably 20 ms. Every 20 
ms, the speech encoder 100A produces 80 bits representing compressed speech. The 
speech decoder 100B receives the 80 compressed speech bits and reconstructs a 20 ms 
frame of speech signal. The encoder 100 A preferably uses a look ahead buffer of at 
least 20 ms, resulting in an algorithmic delay comprising buffering delay and look 
ahead delay of 40 ms. 

[0026] The speech encoder 100A is equipped with a built-in voice activity 
detector (VAD) 102 and can operate in continuous transmission (CTX) mode or in 
discontinuous transmission (DTX) mode. In the DTX mode, comfort noise 
information (CNI) is encoded as part of the compressed bit stream during silence 
intervals. At the decoder 100B, the CNI packets are used by a comfort noise 
generation (CNG) algorithm to regenerate a close approximation of the ambient noise. 
The VAD information is also used by an integrated front end noise reduction scheme 
that can provide varying degrees of background noise level attenuation and speech 
signal enhancement. 

[0027] A single parity check bit is preferably included in the 80 compressed 
speech bits of each frame of the input speech signal to detect channel errors in 
perceptually important compressed speech bits. This enables the codec 100 to operate 
satisfactorily in links with a random bit error rate up to about 10' 3 . In addition, the 
decoder 100B uses bad frame concealment and recovery techniques to extend signal 
processing operations during frame erasures. 

[0028] Additionally, in addition to the speech coding functions, the codec 100 
also has the ability to transparently pass dual tone multifrequency (DTMF) and 
signaling tones. 

[0029] As discussed above, the FDI codec 100 uses the linear predictive analysis 
technique to model the short term Fourier spectral envelope of the input speech signal. 
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Subsequently, a pitch frequency estimate is used to perform a frequency domain 
prototype waveform analysis of the LP residual signal. Specifically, the PW analysis 
provides a characterization of the harmonic or fine structure of the speech spectrum. 
More specifically, the PW magnitude spectrum provides the correction necessary to 
refine the short term LP spectral estimate to obtain a more accurate fit to the speech 
spectrum at the pitch harmonic frequencies. Information about the phase of the signal 
is implicitly represented by the degree of periodicity of the signal measured across a 
set of subbands. 

[0030] In a preferred embodiment of the present invention, the input speech signal 
is processed in consecutive non-overlapping frames of 20 ms duration, which 
corresponds to 160 samples at the sampling frequency of 8000 samples/sec. The 
encoder 100A parameters are quantized and transmitted once for each 20 ms frame. A 
look-ahead of 20 ms is used for voice activity detection, noise reduction, LP analysis 
and pitch estimation. This produces in an algorithmic delay which is defined as a 
buffering delay and a look-ahead delay of 40 ms. 

[0031] Referring to FIG. 2 which illustrates the samples used for various 
functions at the encoder 100A, an estimated size of buffered samples for various 
frames is shown. For example, a VAD window 210 uses buffered samples from about 
160 to 400 samples. A noise reduction window 220 uses about the same number of 
samples. Pitch estimation windows 230i up to 230s each uses about 240 samples. The 
LP analysis window processes the signal in about 80 to 400 samples. A current frame 
being encoded is processed between 80 to 240 samples. A new input speech data 260 
and look-ahead 280 are processed from about 240 to 400 samples while a past data is 
processed from zero to 80 samples. For the purposes of excitation modeling, each 
frame is further divided into 8 subframes preferably of duration 2.5 ms or 20 samples. 
[0032] The invention will now be discussed in terms of front end processing, 
specifically input preprocessing. The new input speech samples are first scaled down 
by preferably 0.5 to prevent overflow in fixed point implementation of the coder 
100A. In another embodiment of the present invention, the scaled speech samples can 
be high-pass filtered using an infinite impulse response (IIR) filter with a cut-off 



- 10 - 



frequency of 60 Hz, to eliminate undesired low frequency components. The transfer 
function of the 2nd order high pass filter is given by 

„ , , 0.939819335 - L879638672Z' 1 +0.939819335z" 2 

H hnf Az) = ( 

Pf 1-1.933195469Z" 1 +0.935913085z~ 2 1 

[0033] In terms of the VAD module 102, the preprocessed signal is analyzed to 
detect the presence of speech activity. This comprises the following operations: 
scaling the signal via an automatic gain control (AGC) mechanism to improve VAD 
performance for low level signals, windowing the Automatic Gain Control (AGC) 
scaled speech and computing a set of autocorrelation lags, performing a 10 th order 
autocorrelation LP analysis of the AGC scaled speech to determine a set of LP 
parameters which are used during pitch estimation, performing a preliminary pitch 
estimation based on the pitch candidates for the look-ahead part of the buffer, 
performing voice activity detection based on the autocorrelation lags and pitch 
estimate and the tone detection flag that is generated by examining the distance 
between adjacent line spectral frequencies (LSFs) which will be described in greater 
detail below with respect to conversion to line spectral frequencies. 
[0034] This series of operations produces a VAD FLAG and a VIDJFLAG that 
have the following values depending on the detected voice activity: 



It should be noted that the VAD _ FLAG and the VID FLAG represent the voice 
activity status of the look-ahead part of the buffer. A delayed VAD flag, 
VAD _ FLAG _DL\ is also maintained to reflect the voice activity status of the current 
frame. In a presentation given during an IEEE speech and audio processing workshop 
in Finland during 1999, the entire contents of the documentation being incorporated 
by reference herein, the presenters F. Basbug, S. Nandkumar and K. Swamianthan 
described an AGC front-end for the VAD which itself is a variation of the voice 




if voice activity is present, 
if voice activity is absent. 
if voice activity is present, 
if voice activity is absent. 



activity detection algorithms used in cellular standards "TDMA cellular/PCS Radio 
Interface - Minimum Objective Standards for IS-136 B, DTX/CNG Voice Activity 
Detection", which is also incorporated by reference in its entirety. A by-product of 
the AGC front-end is the global signal-to-noise ratio, which is used to control the 
degree of noise reduction. 

[0035] The VAD flag is encoded explicitly only for unvoiced frames as 
indicated by the voicing measure flag. Voiced frames are assumed to be active 
speech. In the present embodiment of the invention, the VAD flag is not coded 
explicitly. The decoder sets the VAD flag to a one for all voiced frames. However, it 
will be appreciated by those skilled in the art that the VAD flag can be coded 
explicitly without departing from the scope of the present invention. 
[0036] Noise reduction module 104 provides noise reduction to the voice activity 
detected speech signal. Specifically, the preprocessed speech signal is processed by a 
noise reduction algorithm to produce a noise reduced speech signal. The following is 
a series of steps comprising the noise reduction algorithm: A trapezoidal windowing 
and the computing of the complex discrete Fourier transform (DFT) of the signal is 
performed. FIG. 2 depicts the part of the buffer that undergoes the DFT operation. A 
256-point DFT (240 windowed samples +16 padded zeros) is used. The magnitude of 
the DFT is smoothed along the frequency axis across a variable window whose width 
is about 187.5 Hz in the first 1 KHz, about 250 Hz in the range of 1-2 KHz, and about 
500 Hz in the range of 2-4 KHz regions. These values reflect a compromise between 
the conflicting objectives of preserving the format structure and having sufficient 
smoothness of the speech signal. 

[0037] If the wad _ flag , which is the VAD output prior to hangover, is a one 
which indicates voice activity, then the smoothed magnitude square of the DFT is 
taken to be the smoothed power spectrum of noisy speech S(k) . However, if the 
wad _ flag is a zero indicating voice inactivity, the smoothed DFT power spectrum is 
then used to update a recursive estimate of the average noise power spectrum N av (k) 
as follows: 
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N av (k) = 0.9 • N av (k) + 0. 1 • S(k) if VAD _ FZ^G - 0 (2) 
A spectral gain function is then computed based on the average noise power spectrum 
and the smoothed power spectrum of the noisy speech. The gain function G nr {k) takes 
the following form: 

Gnri k) = ^ (3) 

F„ r N av (k) + S(k) 

Here, the factor F nr is a factor that depends on the global signal-to-noise-ratio 

SNR global that is generated by the AGC front-end for the VAD. The factor F nr can 

be expressed as an empirically derived piecewise linear function of SNR global that 

is monotonically non-decreasing. The gain function is close to unity when the 
smoothed power spectrum S(k) is much larger than the average noise power 

spectrum N av (k) . Conversely, the gain function becomes small when S(k) is 

comparable to or much smaller than N av (k) . The factor F nr controls the degree of 

noise reduction by providing for a higher degree of noise reduction when the 
global signal-to-noise ratio is high (i.e., risk of spectral distortion is low since 
VAD and the average noise estimate are fairly accurate). Conversely, the factor 
restricts the amount of noise reduction when the global signal-to-noise ratio is low. 
For example, the risk of spectral distortion is high due to increased VAD 
inaccuracies and less accurate average noise power spectral estimate. 
[0038] The spectral amplitude gain function is further clamped to a floor which is 
a monotonically non-increasing function of the global signal-to-noise ratio. This kind 
of clamping reduces the fluctuations in the residual background noise after noise 
reduction making the speech sound smoother. The clamping action is expressed as: 
G m (k) = MAX{G nr {k\ T global (SNR global ) (4) 

Thus, at high global signal-to-noise ratios, the spectral gain functions will be clamped 
to a lower floor since there is less risk of spectral distortion due to inaccuracies in the 
VAD or the average noise power spectral estimate N av (k) . But at lower global 
signal-to-noise ratio, the risks of spectral distortion outweigh the benefits 
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of reduced noise and therefore a higher floor would be appropriate. 
[0039] In order to reduce the frame-to-frame variation in the spectral amplitude 
gain function, a gain limiting device is used which limits the gain between a range 
that depends on the previous frame's gain for the same frequency. The limiting action 
can be expressed as follows: 

G;n*) = M^({^G^ (5) 
The scale factors S% r and S* are updated using a state machine whose actions depend 
on whether the frame is active, inactive or transient. 

[0040] FIG. 3 depicts a flowchart 300 which performs scale factor updates in 
accordance with an embodiment of the present invention. The process 300 is occrs in 
noise reduction module 104 and is initiated at step 302 where input values 
VADFLAG and scale factors are received. The method 300 then proceeds to step 
304 where a determination is made as to whether the VADFLAG is zero which 
indicates voice activity is absent. If the determination is affirmative the method 300 
proceeds to step 306 where the scale factors are adjusted to be closer to unity. The 
method 300 then proceeds to step 308. 

[0041] At step 308 a determination is made as to whether the VADFLAG was 
zero for the last two frames. If the determination is affirmative the method proceeds to 
step 310 where the scale factors are limited to be very close to unity. However, if the 
determination was negative, the method 300 then proceeds to step 312 where the scale 
factors are limited to be away from unity. 

[0042] If the determination at step 304 was negative, the method 300 then 
proceeds to step 314 where the scale factors are adjusted to be away from unity. The 
method 300 then proceeds to step 316 where the scale factors are limited to be far 
away from unity. 

[0043] The steps 3 1 0, 3 1 2 and 3 1 6 proceed to step 3 1 8 where the updated scale 
factors are outputted. 

[0044] The final spectral gain function G^ e r w (k) is multiplied with the complex 
DFT of the preprocessed speech, attenuating the noise dominant frequencies and 
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preserving signal dominant frequencies. An overlap-and-add inverse DFT is then 
performed on the spectral gain scaled DFT to compute a noise reduced speech signal 
over the interval of the noise reduction window 

[0045] Since the noise reduction is carried out in the frequency domain, the 
availability of the complex DFT of the preprocessed speech is taken advantage of in 
order to carry out DTMF and Signaling tone detection. These detection schemes are 
based on examination of the strength of the power spectra at the tone frequencies, the 
out-of-band energy, the signal strength, and validity of the bit duration pattern. It 
should be noted that the incremental cost of having such detection schemes to 
facilitate transparent transmission of these signals is negligible since the power 
spectrum of the preprocessed speech is already available. 

[0046] An embodiment of the invention will now be described in terms of LPC 
analysis filtering module 106. The noise reduced speech signal is subjected to a 10 th 
order autocorrelation method of LP analysis where {s nr («),0 < n < 400} denotes the 
noise reduced speech buffer, where {s nr O),80 < n < 240} is the current frame being 
encoded and {s nr (n),240 < n < 320} is the look-ahead buffer 280 as shown in FIG. 2. 
In the LP analysis of speech, the magnitude spectrum of short segments of speech is 
modeled by the magnitude frequency response of an all-pole minimum phase filter, 
whose transfer function is represented by 

h ¥ {z)=s-1 — ( 6 ) 

m=0 

Here, {a m ,0 < m < M } are the LP parameters for the current frame and M =10 is the 
LP order. LP analysis is performed using the autocorrelation method with a modified 
Harming window of size 40 ms (320 samples) which includes the 20 ms current frame 
and the 20 ms lookahead frame as shown in FIG. 2. 

[0047] The noise reduced speech signal over the LP analysis window 
{s nr (n)£0<n <400} is windowed using a modified Harming window function 

{w [p (n),0 <n< 320} defined as follows: 
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0.5-0.5cos(— ), 0<n<240, 
319 

(0.5-0.5cos(— )) 
319 



(7) 



cos 2 ( 



v 2rt(«-240) x ' 
320 



240<«<320 



The windowed speech buffer is computed by multiplying the noise reduced speech 
buffer with the window function as follows: 

s w (n) = s nr (S0 + n)w lp (n) 0<«<240. (8) 

Normalized autocorrelation lags are computed from the windowed speech by 



2X( w K («+>«) 

r fr( m ) =J=2 -ii5 0<m<10, 



(9) 



2>ioo 



The autocorrelation lags are windowed by a binomial window with a bandwidth 
expansion of 60 Hz.. The binomial window is given by the following recursive rule: 

1 m = 0 

4995 -m (10) 



\<m<\0. 



4994 + m 

Lag windowing is performed by multiplying the autocorrelation lags by the binomial 
window: 

r lpw (m) = r lp (m)l w (w) 1 < m < 10. (11) 

The zeroth windowed lag ^ (0) is obtained by multiplying by a white noise 

correction factor of about 1.0001, which is equivalent to adding a noise floor at -40 
dB: 

Vr (0) = 1.0001r fr (0) . (12) 
[0048] Lag windowing and white noise correction are techniques are used to 
address problems that arise in the case of periodic or nearly periodic signals. For such 
signals, the all-pole LP filter is marginally stable, with its poles very close to the unit 
circle. It is necessary to prevent such a condition to ensure that the LP quantization 
and signal synthesis at the decoder 1 00B can be performed satisfactorily. 
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[0049] The LP paramerters that define a minimum phase spectral model to the 
short term spectrum of the current frame are determined by applying Levinson-Durbin 
recursions to the windowed autocorrelation lags {r lpw {m\ 0 < m < 10} . The resulting 
1 0 th order LP parameters for the current frame are {a m ,0 < m < 10} , with a' Q = 1 . Since 

the LP analysis window is centered around the sample index of about 240 in the 
buffer, the LP parameters represent the spectral characteristics of the signal in the 
vicinity of this point. 

[0050] During highly periodic signals, the spectral fit provided by the LP model 
tends to be excessively peaky in the low formant regions, resulting in audible 
distortions. To overcome this problem, a bandwidth broadening scheme has been 
employed in this embodiment of the present invention, where the formant bandwidth 
of the model is broadened adaptively, depending on the degree of peakiness of the 
spectral model. The LP spectrum is given by 



S(en = 



2X' 

m=0 



(13) 



where co m denotes the pitch frequency estimate of the m' h subframe (1 < m < 8) of the 
current frame in radians/sample. Given this pitch frequency, the index of the highest 
frequency pitch harmonic that falls within the frequency band of the signal (0-4000 
Hz or 0-7t radians) for the m' h subframe is given by 



K_ = 



CO, 



1 < m < 8, 



(14) 



where, denotes the largest integer less than or equal to x . The magnitude of the 
LPC spectrum is evaluated at the pitch harmonics by 

1 



\S(k)\= SCc^"*) =■ 



0<k<K x 



(15) 



2X 



~j(d s km 
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It should be noted that co 8 corresponds to the 8 th subframe has been used here since 
the LP parameters have been evaluated for a window centered around a sample of 
about 240 as shown in FIG. 2. A logarithmic peak-to-average ratio of the harmonic 
spectral magnitudes is computed as 



PAR =l0log } 



MAX\S(k)\ 

l<k<K z 1 1 



1 



dw, p = 



10 + 2 PAR, 
20 + 12 (PAR- 5), 
80 + 4 (PAR -10), 
120 



(16) 



The peak-to-average ratio ranges from 0 dB (for flat spectra) to values exceeding 20 
dB (for highly peaky spectra). The expansion in formant bandwidth (expressed in Hz) 
is then determined based on the log peak-to-average ratio according to a piecewise 
linear characteristic: 

PAR<5, 
PAR<!0, 
PAR < 20, 
PAR > 20. 

The expansion in bandwidth ranges from a minimum of about 10 Hz for flat spectra to 
a maximum of about 120 Hz for highly peaky spectra. Thus, the bandwidth expansion 
is adapted to the degree of peakiness of the spectra. The above piecewise linear 
characteristic have been experimentally optimized to provide the right degree of 
bandwidth expansion for a range of spectral characteristics. A bandwidth expansion 
factor a bw to apply this bandwidth expansion to the LP spectrum is obtained by 



Or — p 8000 



(18) 



The LP parameters representing the bandwidth expanded LP spectrum are determined 
by 



a m =a' m aZ 0<m<10. 
[005 1] The bandwidth expanded LP filter coefficients are converted to line 
spectral frequencies (LSFs) for quantization and interpolation purposes which is 



(19) 
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described in "Line Spectral Representation of Linear Predictive Coefficients of 
Speech Signals" Journal of Acoustical Society of America, vol. 57, no. 1, 1975 by F. 
Itakura which is incorporated by reference in its entirety. An efficient approach to 
computing LSFs from LP parameters using Chebychev polynomials is described in 
"The Computation of Line Spectral Frequencies Using Chebyshev Polynomials," 
IEEE Trans. On ASSP, vol. 34, no 6, pages 1419-1426, Dec. 1986 by P. Kabal and 
R.P. Ramachandran which is herein incorporated by reference in its entirety. The 
resulting LSFs for the current frame are denoted by {X(m) 9 0 < m < 10} . 
[0052] The LSF domain also lends itself to detection of highly periodic or 
resonant inputs. For such signals, the LSFs located near the signal frequency have 
very small separations. If the minimum difference between adjacent LSF values falls 
below a threshold for a number of consecutive frames, it is highly probable that the 
input signal is a tone. 

[0053] FIG. 4 describes a method 400 for tone detection in accordance with an 
embodiment of the present invention. The method 400 occurs in LPC analysis 
filtering module 106 and is initiated at step 402 where a tone counter is set 
illustratively for a maximum of 16. The method 400 then proceeds to step 404 where 
a determination is made as to whether the LSF value falls below a minimum threshold 
of for example 0.008. If the determination is answered negatively, the method 400 
then proceeds to step 406 where the tone counter detects that the LSF value is above 
the threshold. 

[0054] If the method 404 is answered affirmatively, the tone counter detects that 
the LSF value is below the threshold and increments the counter by one. The methods 
406 and 412 proceed to step 408. 

[0055] At step 408 a determination is made as to whether the tone counter is at its 
maximum value. If the method 408 is answered negatively, the method 400 proceeds 
to step 410 where a tone flag equals false indication is provided. If the method 408 is 
answered negatively, the method 400 then proceeds to step 414 where a tone flag 
equals true indication is provided. 
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[0056] The steps 410 and 414 proceed to step 416 where the method 400 
continues checking for tones. Specifically, method 400 provides a tone flag indication 
which is a one if a tone has been detected and a zero otherwise. This flag is also used 
in voice activity detection. 

[0057] The invention will now be described in reference to the pitch estimation 
and interpolation module 110. Pitch estimation is performed based on an 
autocorrelation analysis of a spectrally flattened low pass filtered speech signal. 
Spectral flattening is accomplished by filtering the AGC scaled speech signal using a 
pole-zero filter, constructed using the LP parameters of AGC scaled speech signal. If 
{a a m gc ,0 < m < 10} are the LP parameters of AGC scaled speech signal, the pole-zero 
filter is given by 

M 

H ^)=-^ • ( 20 ) 

£«r(0.8)"z-" 

The spectrally flattened signal is low-pass filtered by a 2 nd order IIR filter with a 3 dB 
cutoff frequency of 1000 Hz. The transfer function of this filter is 

_ 0.06745527 - 0.134910548Z' 1 + 0.06745527z' 2 
//7/1 1-1.14298050Z" 1 +0.41280159z" 2 

[0058] The resulting signal is subjected to an autocorrelation analysis in two 

stages. In the first stage, a set of four raw normalized autocorrelation functions (ACF) 

are computed over the current frame. The windows for the raw ACFs are staggered by 

40 samples as shown in FIG. 2. The raw ACF for the /^window is computed by 

40(M)+239-/ 

r raw {U)- ^ 1)+2 3, 15</<125, 2<z<5. (22) 

b=40(i-1) 

[0059] In each frame, raw ACFs corresponding to windows 2, 3, 4 and 5 as shown 
in FIG. 2 are computed. In addition, a raw ACF for window 1 is preserved from the 
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previous frame. For each raw ACF, the location of the peak within the lag range 
20 < / < 1 20 is determined. 

[0060] In the second stage, each raw ACF is reinforced by the preceding and the 
succeeding raw ACF, resulting in a composite ACF. For each lag / in the raw ACF in 
the range 20</<120, peak values within a small range of lags 
[(/- w c (/)),(/ + w c (/))] are determined in the preceding and the succeeding raw 
ACFs. These peak values reinforce the raw ACF at each lag /, via a weighted 
combination: 

w c (l) + l-0Am peak (l) 



r comp^ 1 ) O c (/)+l) 

Wc (/) + l-0.1»^(7) 



MAX r mw (i-\,m) 

l-» c (l)<m<l+w c (l) 



(w c (Z) + l) 



MAX r raw (i + \,n) 

l-w c (/)<«</+ w c (/) 



20</<120,2</<5. 

(23) 



Here, w c (/) determines the window length based on the lag index / : 
w c (/) = < 



2 /<30 

[0.05/ + 0.5 J 30</<70. (24) 
4 />70 



[0061] Also, m peak (/) and n peak (/) are the locations of the peaks within the 
window. The weighting attached to the peak values from the adjacent ACFs ensures 
that the reinforcement diminishes with increasing difference between the peak 
location and the lag / . The reinforcement boosts a peak value if peaks also occur at 
nearby lags in the adjacent raw ACFs. This increases the probability that such a peak 
location is selected as the pitch period. ACF peaks locations due to an underlying 
periodicity do not change significantly across a frame. Consequently, such peaks are 
strengthened by the above process. On the other hand, spurious peaks are unlikely to 
have such a property and consequently are diminished. This improves the accuracy of 
pitch estimation. 

[0062] Within each composite ACF the locations of the two strongest peaks are 
obtained. These locations are the candidate pitch lags for the corrresponding pitch 
window, and take values in the range 20 - 120 which is inclusive. In conjunction with 
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the two peaks from the last composite ACF of the previous frame i.e., for window 5 in 
the previous frame, results in a set of 5 peak pairs, leading to 32 possible pitch tracks 
through the current frame. A pitch metric is used to maximize the continuity of the 
pitch track as well as the value of the ACF peaks along the pitch track to select one of 
these pitch tracks. The end point of the optimal pitch track determines the pitch period 
p % and a pitch gain $ pifch for the current frame. Note that due to the position of the 

pitch windows, the pitch period and pitch gain are aligned with the right edge of the 
current frame The pitch period is integer valued and takes on values in the range 20 - 
120. It is mapped to a 7-bit pitch index /* in the range of about 0-101. 

[0063] In respect to the prototype extraction module 108 and the pitch estimation 
and interpolation module 110, the pitch period is converted to the radian pitch 
frequency corresponding to the right edge of the frame by 

co s =— . (24) 

P* 

A subframe pitch frequency contour is created by linearly interpolating between the 
pitch frequency of the left edge co 0 and the pitch frequency of the right edge co 8 : 

8 

If there are abrupt discontinuities between the left edge and the right edge pitch 
frequencies, the above interpolation is modified to make a switch from the pitch 
frequency to its integer multiple or submultiple at one of the subframe boundaries. It 
should be noted that the left edge pitch frequency co o is the right edge pitch frequency 
of the previous frame. 

The index of the highest pitch harmonic within the 4000 Hz band is computed for 
each subframe by 

71 



K = 



l<m<8. (26) 



[0064] The LSFs are quantized by a hybrid scalar-vector quantization scheme. 
The first 6 LSFs are scalar quantized using a combination of intraframe and 
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interframe prediction using 4 bits/LSF. The last 4 LSFs are vector quantized using 7 
bits. Thus, a total of 31 bits are used for the quantization of the 10-dimensional LSF 
vector. 

[0065] The 16 level scalar quantizers for the first 6 LSFs in a preferred 
embodiment of the present invention is designed using a Linde-Buzo-Gray algorithm. 
An LSF estimate is obtained by adding each quantizer level to a weighted 
combination of the previous quantized LSF of the current frame and the adjacent 
quantized LSFs of the previous frame: 



Sr m V) + 0375X(rn + l), 



Sr m (/) + 0.375a (m + 1) - X(m - 1)) + X(m - 1), 1 < m < 5, 



\ 0</<15. 



(27) 



Here, {k(m),0 <m<6} are the first 6 quantized LSFs of the current frame and 

{X prev (m),0 < m <10} are the quantized LSFs of the previous frame. 

{S Lm (l),0<m <6,0</<15} are the 16 level scalar quantizer tables for the first 6 

LSFs. The squared distortion between the LSF and its estimate is minimized to 
determine the optimal quantizer level: 

MIN(X(m) - X(l 9 m)) 2 0<m<5. (28) 

0<(<15 

[0066] If l* L s m is the value of / that minimizes the above distortion, the 
quantized LSFs are given by: 



S L , m (/I.,..) + 0.375 X prev (m + 1), m = 0 

S L , m Ql s J + 0375(\ rev (m + l)-i prev (m-l)) + Hm-l), 1 < m < 5. 



(29) 
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The last 4 LSFs are vector quantized using a weighted mean squared error (WMSE) 
distortion measure. The weight vector {W L (m),6 < m < 9} is computed by the 
following procedure: 

pl(m) = f|{4 + cos 2 (27r?i(m)) + cos 2 (27i^(z))-8cos(27i^(m))cos(27i?i(0)}, 6<m<9. 

/=0,2, 4,6,8 

(30) 



p2(m) = n ( 4 + cos2 OM™)) + cos 2 (2tiX(0) - 8 cos(2jcX.(w))cos(2jiA.(0)}, 6 < m < 9. 

1=1,3,5,7,9 

(31) 



1.09-0.6cos(27tX.(m)) 



(0.5 + 0.5 cos(2tcX(w)) pl(m) + (0.5 - 0.5 cos(2nX(m)) p2(m) 



6<m<9. 
(32) 



[0067] A set of predetermined mean values {X dc (m),6 < ra < 9} are used to 

remove the DC bias in the last 4 LSFs prior to quantization. These LSFs are estimated 

based on the mean removed quantized LSFs of the previous frame: 

X(l,m) = V L (/, m - 6) + X dc (m) + 0.5( X prev (m) - X dc (w)), 0 < / < 1 27, 6 < ™ < 9. 

(33) 

[0068] Here {F L (/, m),0 < / < 127,0 < m < 3} is the 128 level, 4-dimensional 
codebook for the last 4 LSFs. The optimal code vector is determined by minimizing 
the WMSE between the estimated and the original LSF vectors: 

MIN ]T W L (m)(X(m) - X(l 9 m)) 2 . (34) 

0_/_127 

[0069] If /* v is the value of / that minimizes the above distortion, the quantized 
LSF subvector is given by: 

Mm) = V L (ll_ v , m - 6) + X dc (m) + 0.5(X prev (m) - X dc (w )), 6 < m < 9. (35) 
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[0070] The stability of the quantized LSFs is checked by ensuring that the LSFs 
are monotonically increasing and are separated by a minimum value of about 0.008. If 
this criteria is not satisfied, stability is enforced by reordering the LSFs in a 
monotonically increasing order. If a minimum separation is not achieved, the most 
recent stable quantized LSF vector from a previous frame is substituted for the 
unstable LSF vector. The 6 4-bit SQ indices {/* S m ,0 < m < 5} and the 7-bit VQ 

index l* L v are transmitted to the decoder. Thus the LSFs are encoded using a total of 

31 bits. 

[007 1] The inverse quantized LSFs are interpolated each subframe by preferably 
linear interpolation between the current LSFs {i(m),0 < m < 10} and the previous 
LSFs & prev (m),0 < m < 10} . The interpolated LSFs at each subframe are converted to 
LP parameters {a m (/),0 < m < 10,1 < / < 8} . 

[0072] The prediction residual signal for the current frame is computed using the 
noise reduced speech signal {s nr (w)} and the interpolated LP parameters. Residual is 
computed from the midpoint of a subframe to the midpoint of the next subframe, 
using the interpolated LP parameters corresponding to the center of this interval. This 
ensures that the residual is computed using locally optimal LP parameters. The 
residual for the past data as shown in FIG. 2 is preserved from the previous frame and 
is also used for PW extraction. 

[0073] Further, residual computation extends 93 samples into the look-ahead part 
of the buffer to facilitate PW extraction. LP parameters of the last subframe are used 
computing the look-ahead part of the residual. By denoting the interpolated LP 
parameters for the j th subframe ( 0 < j < 8 ) of the current frame by 
{a m (y),0 < m < 1 0} , residual computation can be represented by: 
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e lp (n) = < 



Y J s„ r (n-m)a m (0) 80<n<90, 

m=0 
M 

%s nr {n-m)a m U) l<j<7 20y + 70<n <20y+90, 

M 

IX("->«K(8; 230<„<332. 



^m=0 



The residual for past data, («),0 < n < 80} is preserved from the previous frame. 
[0074] The invention will now be discussed in reference to PW extraction. 
The prototype waveform in the time domain is essentially the waveform of a single 
pitch cycle, which contains information about the characteristics of the glottal 
excitation. A sequence of PWs contains information about the manner in which the 
excitation is changing across the frame. A time-domain PW is obtained for each 
subframe by extracting a pitch period long segment approximately centered at each 
subframe boundary. The segment is centered with an offset of up to ±10 samples 
relative to the subframe boundary, so that the segment edges occur at low energy 
regions of the pitch cycle. This minimizes discontinuities between adjacent PWs. For 
the m th subframe, the following region of the residual waveform is considered to 
extract the PW: 

{e lp (%0 + 20m + n)-^-\2<n<^- + 12} (3 
2 2 

where p m is the interpolated pitch period (in samples) for the m th subframe. The PW 
is selected from within the above region of the residual, so as to minimize the sum of 
the energies at the beginning and at the end of the PW. The energies are computed as 
sums of squares within a 5-point window centered at each end point of the PW, as the 
center of the PW ranges over the center offset of about ±10 samples: 



^(0=X4(80 + 20m-^ + / + 7) + X<(80 + 20m + ^ + / + ^ -10</<10. 
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[007S] The center offset resulting in the smallest energy sum determines the PW. 
If i (m) is the center offset at which the segment end energy is minimized, i.e., 

min v / 

E en A^Am))<E end (i) -10</<10, (39) 
the time-domain PW vector for the m th subframe is 

{^/80 + 20m - ^ + i m (m) + n), 0<n<pj. This is transformed by a p m -point 

discrete Fourier transform (DFT) into a complex valued frequency-domain PW 
vector: 

K (*) = X e /p (80 + 20m-^ + (m) + 0<k<K m . (40) 

«=0 2 

Here © w is the radian pitch frequency and K m is the highest in-band harmonic index 

for the m th subframe (see equation 17). The frequency domain PW is used in all 
subsequent operations in the encoder. The above PW extraction process is carried out 
for each of the 8 subframes within the current frame, so that the residual signal in the 
current frame is characterized by the complex PW vector sequence 
{P m (k),Q<k<K m ,l<m<S}. In addition, an approximate PW is computed for 
subframe 1 of the look ahead frame, to facilitate a 3-point smoothing of PW gain and 
magnitude. Since the pitch period is not available for the look-ahead part of the 
buffer, the pitch period at the end of the current frame, i.e., p s , is used in extracting 
this PW. The region of the residual used to extract this extra PW is 

{e lp (260 + n) - 12 < n < ^ + 12} . (41) 
2 2 

[0076] By minimizing the end energy sum as before, the time-domain PW vector 
is obtained as {e lp (260 - ^ + i mn (9) + n), 0<n< Ps }. The frequency-domain PW 
vector is designated by P 9 and is computed by the following DFT: 
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P 9 (k) = 2e„(260- + (9) + »>-'-" 0<*<K 8 . (42) 



2 



It should be noted that the approximate PW is only used for smoothing operations and 
not as the PW for subframe 1 during the encoding of the next frame. Rather, it is 
replaced by the exact PW computed during the next frame. 
[0077] Each complex PW vector can be further decomposed into a scalar gain 
component representing the level of the PW vector and a normalized complex PW 
vector representing the shape of the PW vector. Such a decomposition, permits vector 
quantization that is efficient in terms of computation and storage with minimal 
degradation in quantization performance. The PW gain is the root-mean square 
(RMS) value of the complex PW vector. It is obtained by 

S'^^J^TT^tlKikf l<m<S. (43) 
[0078] PW gain is also computed for the extra PW by 

[0079] A normalized PW vector sequence is obtained by dividing the PW vectors 
by the corresponding gains: 

P m{ k)^I^L Q<k<K m A<m<%. (45) 



And for the extra PW: 



P ^ = ~^ 0<k<K s . (46) 

[0080] For a majority of frames, especially during stationary intervals, gain values 
change slowly from one subframe to the next. This makes it possible to decimate the 
gain sequence by a factor of about 2, thereby reducing the number of values that need 
to be quantized. Prior to decimation, the gain sequence is smoothed by a 3-point 
window, to eliminate excessive variations across the frame. The smoothing operation 
is in the logarithmic gain domain and is represented by 
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g» pw (m) = 0.3 log 10 g' pw (m - 1) + 0.41og 10 g' pw (rn) + 0.3 log 10 g^(m + 1) 1 < m < 8. 

(47) 

[0081] Conversion to logarithmic domain is advantageous since it corresponds to 
the scale of loudness of sound perceived by the human ear. The smoothed gain values 
are transformed by the following transformation: 

0 gpw (m)>45, 
gpw (m)= 90-20 g ; w (m) 0< g ;»<4.5, \<m<S (48) 

90 ^(«)<0. 

[0082] This transformation limits extreme (very low or very high) values of the 

gain and thereby improves quantizer performance, especially for low-level signals. 

The transformed gains are decimated by a factor of 2, requiring that only the even 

indexed values, i.e., {g pw {2\ g pw (4), g pw (6\ g pw (S)} , are quantized. 

[0083] At the decoder 100B, the odd indexed values are obtained by linearly 

interpolating between the inverse quantized even indexed values. 

[0084] A 256 level, 4-dimensional vector quantizer is used to quantize the above 

gain vector. The design of the vector quantizer is one of the novel aspects of this 

algorithm. The PW gain sequence can exhibit two distinct modes of behavior. During 

stationary signals, such as voiced intervals, variations of the gain sequence across a 

frame are small. 

[0085] On the other hand, during non- stationary signals such as voicing onsets, 
the gain sequence can exhibit large variations across a frame. The vector quantizer 
used must be able to represent both types of behavior. On the average, stationary 
frames far outnumber the non-stationary frames. 

[0086] If a vector quantizer is trained using a database, which does not distinguish 
between the two types, the training is dominated by stationary frames leading to poor 
performance for non-stationary frames. To overcome this problem, the vector 
quantizer design was modified by classifying the PW gain vectors classified into a 
stationary class and a non-stationary class. 
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[0087] For the 256 level codebook, 192 levels were allocated to represent 
stationary frames and the remaining 64 were allocated for non-stationary frames. The 
192 level codebook is trained using the stationary frames, and the 64 level codebook 
is trained using the non-stationary frames. The training algorithm with a binary split 
and random perturbation is based on the generalized Lloyd algorithm disclosed in "An 
algorithm for Vector Quantization Design", by Y. Linde, A. Buzo and R. Gray, pages 
84-95 of IEEE Transactions on Communications, VOL. COM-28, No. 1, January 
1980 which is incorporated by reference in its entirety. In the case of the stationary 
codebook, a ternary split is used to derive the 192 level codebook from a 64 level 
codebook in the final stage of the training process. The 192 level codebook and the 64 
level codebook are concatenated to obtain the 256-level gain codebook. The 
stationary/non-stationary classification is used only during the training phase. During 
quantization, stationary/non-stationary classification is not performed. Instead, the 
entire 256-level codebook is searched to locate the optimal quantized gain vector. The 
quantizer uses a mean squared error (MSE) distortion metric: 

*>, (o = E t~ ( 2m > - F * TO) 1 2 0 - l - 255 ' (49) 

where, {V 0 (/, m\ 0 < / < 255, 1 < m < 4} is the 256 level, 4-dimensional gain 
codebook and D g (l) is the MSE distortion for the I th codevector. In another 
embodiment of the present invention the optimal codevector {V g (/* , m\ 1 < m < 4} is 
the one which minimizes the distortion measure over the entire codebook, i.e., 

D g (C)<D g (l) 0</<255. (50) 

The 8-bit index of the optimal code-vector /* is transmitted to the decoder as the gain 
index. 

[0088] FIG. 5 is a block diagram showing the separation of stationary and 
nonstationary components of a PW in accordance with an embodiment of the present 
invention and occurs in compute subband nonstationary measure module 1 16. In the 
FDI algorithm, only the PW magnitude information is explicitly encoded. PW Phase 
is not encoded explicitly since the replication of phase spectrum is not necessary for 
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achieving a natural quality in reconstructed speech. However, this does not imply that 
an arbitrary phase spectrum can be employed at the decoder. One important 
requirement on the phase spectrum used at the decoder 100B is that it produces the 
correct degree of periodicity i.e., pitch cycle stationarity across the frequency band. 
Achieving the correct degree of periodicity is extremely important to reproduce 
natural sounding speech. 

[0089] The generation of the phase spectrum at the decoder 1 OOB is facilitated by 
measuring pitch cycle stationarity at the encoder as a ratio of the energy of the non- 
stationary component to that of the stationary component in the PW sequence. 
Further, this energy ratio is measured over 5 subbands spanning the frequency band of 
interest, resulting in a 5 -dimensional vector nonstationarity measure in each frame. 
This vector is quantized and transmitted to the decoder, where it is used to generate 
phase spectra that lead to the correct degree of periodicity across the band. The first 
step in measuring the stationarity of PW is to align the PW sequence. 
[0090] In order to measure the degree of stationarity of the PW sequence, it is 
necessary to align each PW to the preceding PW. The alignment process applies a 
circular shift to the pitch cycle to remove apparent differences in adjacent PWs that 

are due to temporal shifts or variations in pitch frequency. Let P m _ } denote the aligned 
PW corresponding to subframe m-l and let 0 m _j be the phase shift that was applied 
to P m _ } to derive P m _ x . In other words, 

? m - l (k) = P m _ i (k)e J °~»-> i 0<k<K m _ r (51) 

[0091] For the alignment of P m to P m __ x , if the residual signal is perfectly periodic 
with the pitch period being an integer number of samples, P m and P m _ x are identical 
except for a circular shift. In this case, the pitch cycle for the m th subframe is 
identical to the pitch cycle for the m - X th subframe, except that the starting point for 
the former is at a later point in the pitch cycle compared to the latter. The difference in 
starting point arises due to the advance by a subframe interval and differences in 
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center offsets at subframes mandm-1. With the subframe interval of 20 samples 
and with center offsets of z mn (m) and (m-l), it can be seen that the m' h pitch 
cycle is ahead of the m - 1* pitch cycle by 20 + (m) - {m - 1) samples. If the 
pitch frequency is co m , a phase shift of - co m (20 + i mn {in) - i nijn (m - 1)) is necessary 
to correct for this phase difference and align P m with P m _, . In addition, since P m _, has 
been circularly shifted by 9 m _, to derive P m _ y , it follows that the phase shift needed to 
align P m with P m _, is a sum of these two phase shifts and is given by 

e m -, -(0 M (20 + / min (m)-z mjn (m-l)). (52) 
[0092] In practice, the residual signal is not perfectly periodic and the pitch period 
can be non-integer valued. In such a case, the above cannot be used as the phase shift 
for optimal alignment. However, for quasi-periodic signals, the above phase angle can 
be used as a nominal shift and a small range of angles around this nominal shift angle 
are evaluated to find a locally optimal shift angle. Satisfactory results have been 
obtained with about an angle range of ± 0.2tc centered around the nominal shift 
angle, searched in steps of about 0.04* . For each shift within this range, the shifted 
version of P m is correlated against P m _ x . The shift angle that results in the maximum 
correlation is selected as the locally optimal shift. This correlation maximization can 
be represented by 

MAX Y Refe , (k)P' m (k)e- J ^ -^<->^- <"- , » +0 ^ ] 

where * represents complex conjugation and Re[ ] is the real part of a complex 
vector. If / = / maK maximizes the above correlation, then the locally optimal shift angle 
is 

G~ =9~ -co ffl (20 + / mjn (m)-i mm (« I -l)) + 0.0471^ (54) 
and the aligned PW for the m' h subframe is obtained from 

P m (k) = P m (k)e^ k 0<k<K m . 
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[0093] The process of alignment results in a sequence of aligned PWs from which 
any apparent dissimilarities due to shifts in the PW extraction window, pitch period 
etc. have been removed. Only dissimilarities due to the shape of the pitch cycle or 
equivalents the residual spectral characteristics are preserved. Thus, the sequence of 
aligned PWs provides a means of measuring the degree of change taking place in the 
residual spectral characteristics i.e., the degree of stationary of the residual spectral 
characteristics. The basic premise of the FDI algorithm is that it is important to 
encode and reproduce the degree of stationarity of the residual in order to produce 
natural sounding speech at the decoder. Consider the temporal sequence of aligned 
PWs along the k' h harmonic track, i.e., 

{P m (k),\<m<%}. ( 56 ) 
[0094] If the signal is perfectly periodic, the k' h harmonic is identical for all 
subframes, and the above sequence is a constant as a function of m. If the signal is 
quasi-periodic, the sequence exhibits slow variations across the frame, but is still a 
predominantly low frequency waveform. It should be noted that here frequency refers 
to evolutionary frequency, related to the rate at which PW changes across a frame. 
This is in contrast to harmonic frequency, which is the frequency of the pitch 
harmonic. Thus, a high frequency harmonic component changing slowly across the 
frame is said to have low evolutionary frequency content. Or a low frequency 
harmonic component changing rapidly across the frame is said to have high 
evolutionary frequency content. 

[0095] As the signal periodicity decreases, variations in the above PW sequence 
increase, with decreasing energy at lower frequencies and increasing energy at higher 
frequencies. At the other extreme, if the signal is aperiodic, the PW sequence exhibits 
large variations across the frame, with a near uniform energy distribution across 
frequency. Thus, by determining the spectral energy distribution of aligned PW 
sequences along a harmonic track, it is possible to obtain a measure of the periodicity 
of the signal at that harmonic frequency. By repeating this analysis at all the 
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harmonics within the band of interest, a frequency dependent measure of periodicity 
can be constructed. 

[0096] The relative distribution of spectral energy of variations of PW between 
low and high frequencies can be determined by passing the aligned PW sequence 
along each harmonic track through a low pass filter and a high pass filter. In an 
embodiment of the present invention, the low pass filter used is a 3 rd order chebyshev 
filter with a 3 dB cutoff at 35 Hz (for the PW sampling frequency of 400 Hz), with the 
following transfer function: 



0.063536 -0.039167Z-' -0.039167z- 2 + 0.063536z' (5?) 
H »* 2 (Z) ~ 1-2.2255Z- 1 +1.7265Z" 2 +0.45231z" 3 

The high pass filter used is also a 3 rd order chebyshev filter with a 3 dB cutoff at 18 
Hz with the following transfer function: 

0.71923 - 2.1 146z ' +2.1 146z' 2 -0.71923z' 3 (5g) 
Hh pf^ z) ~ 1-2.2963Z' 1 +1.8542Z" 2 -5.1726z" 3 

[0097] The output of the low pass filter is the stationary component of the PW 
that gives rise to pitch cycle periodicity and is denoted by 

{S m (k), 0 < k < K m , 1 < m < 8} . The output of the high pass filter is the nonstationary 
component of PW that gives rise to pitch cycle aperiodicity and is denoted by 
{R m W> 0<k<K m ,\^™<&}. The energies of these components are computed in 
subbands and then averaged across the frame. 

[0098] The harmonics of the stationary and nonstationary components are 
grouped into 5 subbands spanning the frequency band of interest where the band- 
edges in Hz is defined by the array 

B„=[l 400 800 1600 2400 3400]. (59) 

The subband edges in Hz can be translated to subband edges in terms of harmonic 
indices such that the i' h subband contains harmonics with indices 
{n m (i -1) < k <r\ m (i), 1 < i < 5} as follows: 



- 34 - 



2 + 



1 + 



4000 
4000 
4000 



1 + 



4000 



. Kit* 
' 4000ai 



MIL* 

4000 
otherwise. 



> 



4000© _ 



>,0 <i<5,l<m< 



The energy in each subband is computed by averaging the squared magnitude of each 
harmonic within the subband. For the stationary component, the subband energy 
distribution for the m th subframe is computed by 

1 



2 (n.(0-Ti m (/-i))t^tw-i) 



l</<5. 



(61) 



For the nonstationary component, the subband energy distribution for the 
m ,h subframe is computed by 

1 



ER m (l) = 



n„(0-i 



2(ri CT (/)-Tl m (/-l))^-i) 
Next, these subframe energies are averaged across the frame: 

^(0 = ^2^(0, 



l</<5. 



(62) 



* m=l 



ER ^(0 = HER m (l), l</<5. 



(63) 
(64) 



The subband nonstationarity measure is computed as the ratio of the energy of the 
nonstationary component to that of the stationary component in each subband: 

ER av Al) 



9l(/) = 



ES av Al) 



l</<5. 



(65) 



[0099] If this ratio is very low, it indicates that the PW sequence has much higher 
energy at low evolutionary frequencies than at high evolutionary frequencies, 
corresponding to a predominantly periodic signal or stationary PW sequence. On the 
other hand, if this ratio is very high, it indicates that the PW sequence has much 
higher energy at high evolutionary frequencies than at low evolutionary frequencies, 
corresponding to a predominantly aperiodic signal or nonstationary PW sequence. 
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Intermediate values of the ratio indicate different mixtures of periodic and aperiodic 
components in the signal or different degrees of stationarity of the PW sequence. This 
information can be used at the decoder to create the correct degree of variation from 
one PW to the next, as a function of frequency and thereby realize the correct degree 
of periodicity in the signal. 

[OOIOO] In case of nonstationary voiced signals, where the pitch cycle is changing 
rapidly across the frame, the nonstationarity measure may have high values even in 
low frequency bands. This is usually a characteristic of unvoiced signals and usually 
translates to a noise-like excitation at the decoder. However, it is important that non- 
stationary voiced frames are reconstructed at the decoder with glottal pulse-like 
excitation rather than with noise-like excitation. This information is conveyed by a 
scalar parameter called a voicing measure, which is a measure of the degree of 
voicing of the frame. During stationary voiced and unvoiced frames, there is some 
correlation between the nonstationarity measure and the voicing measure. However, 
while the voicing measure indicates if the excitation pulse should be a glottal pulse or 
a noiselike waveform, the nonstationarity measure indicates how much this excitation 
pulse should change from subframe to subframe. The correlation between the voicing 
measure and the nonstationarity measure is exploited by vector quantizing these 
jointly. 

[OOlOl] The voicing measure is estimated for each frame based on certain 
characteristics correlated with the voiced/unvoiced nature of the frame. It is a 
heuristic measure that assigns a degree of voicing to each frame in the range 0-1, with 
a zero indicating a perfectly voiced frame and a one indicating a completely unvoiced 
frame. 

[00102] The voicing measure is determined based on six measured characteristics 
of the current frame which are, the average of the nonstationarity measure in the 3 low 
frequency subbands, a relative signal power which is computed as the difference 
between the signal power of the current frame and a long term average signal power, 
the pitch gain, the average correlation between adjacent aligned PWs, the 1 st reflection 
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coefficient obtained during LP Analysis, and the variance of the candidate pitch lags 
computed during pitch estimation. 

[00103] The (squared) normalized correlation between the aligned PW of the m" 
and m - I th frames is obtained by 



' 6 ~ T 



Zfc(*)flpL(*)f 



(66) 



[00104] It should be noted that the upper limit of the summations are limited to 6 
rather than K m to reduce computational complexity. This subframe correlation is 
averaged across the frame to obtain an average PW correlation: 



1 8 



V«* = g&* ■ (67) 



The average PW correlation is a measure of pitch cycle to pitch cycle correlation after 
variations due to signal level, pitch period and PW extraction offset have been 
removed. It exhibits a strong correlation to the nature of glottal excitation. 
As mentioned earlier, the nonstationarity measure, especially in the low frequency 
subbands, has a strong correlation to the voicing of the frame. An average of the 
nonstationarity measure for the 3 lowest subbands provides a useful parameter in 
inferring the nature of the glottal excitation. This average is computed as 

*«*=4i>/- (68) 

It will be appreciated by those skilled in the art that subbands other than the three 
lowest subbands can be used without departing from the scope of the present 
invention. 

[00105] The pitch gain is a parameter that is computed as part of the pitch analysis 
function. It is essentially the value of the peak of the autocorrelation function (ACF) 
of the residual signal at the pitch lag. To avoid spurious peaks, the ACF used in the 
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embodiment of this invention is a composite autocorrelation function, computed as a 
weighted average of adjacent residual raw autocorrelation functions. 
[00106] The pitch gain, denoted by $ pilch , is the value of the peak of a composite 
autocorrelation function. The composite ACF are evaluated once every 40 samples 
within each frame at 80, 120, 160, 200 and 240 samples as shown in FIG. 2. For each 
of the 5 ACF, the location of the peak ACF is selected as a candidate pitch period. 
The variation among these 5 candidate pitch lags is also a measure of the voicing of 
the frame. For unvoiced frames, these vales exhibit a higher variance than for voiced 
frames. The mean is computed as 

1 4 

P _ cand avg =~Y,P- cand i - (69) 

^ /=0 

The variation is computed by the average of the absolute deviations from this mean: 

1 4 

/>var = - 2 \P - Cand avg ~P- CCmd 1 1 • (70) 

This parameter exhibits a moderate degree of correlation to the voicing of the signal. 
[OO 107] The signal power also exhibits a moderate degree of correlation to the 
voicing of the signal. However, it is important to use a relative signal power rather 
than an absolute signal power, to achieve robustness to input signal level deviations 
from nominal values. The signal power in dB is defined as 



£*=101og 10 



1 239 

.160 



(71) 



[OO 108] An average signal power can be obtained by exponentially averaging the 
signal power during active frames. Such an average can be computed recursively 
using the following equation: 

E S[gavg = 0.95E 5igavg +0.05E slg . (72) 

[00109] A relative signal power can be obtained as the difference between the 
signal power and the average signal power: 

& sigrel = E S ig ~ & sigavg * (73) 
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[001 10] The relative signal power measures the signal power of the frame relative a 
long term average. Voiced frames exhibit moderate to high values of relative signal 
power, whereas unvoiced frames exhibit low values. 

[001 1 1J The 1 st reflection coeffient p { is obtained as a byproduct of LP analysis 
during Levinson-Durbin recursion. Conceptually it is equalivalent to the 1 st order 
normalized autocorrelation coefficient of the noise reduced speech. During voiced 
speech segments, the speech spectrum tends to have a low pass characteristic, which 
results in a p, close to 1. During unvoiced frames, the speech spectrum tends to have 
a flatter or high pass characteristic, resulting in smaller or even negative values for 
Pi- 

[001 12] To derive the voicing measure, each of these six parameters are 
nonlinearly transformed using sigmoidal functions such that they map to the range 0 - 
1, close to 0 for voiced frames and close to 1 for unvoiced frames. The parameters for 
the sigmoidal transformation have been selected based on an analysis of the 
distribution of these parameters. The following are the transformations for each of 
these parameters: 

1 



(l + e 



-12^-0 48) 



) 



(74) 



n = < 

pw 



(l +e - 10 <w- 072 >) 



1 



(1 + e" 
1 



13(7,^-0 72) 



) 



<0.72 



Y avz >0.72 



(i + «r 7 <*~-°»>) 
i 



{(l + e' xm ^-° 72) ) 



* avg * 0.85 



K avg > 0.85 



(75) 



(76) 



1- 



(1 + e" 1 



-2)> 



(77) 
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0.5-12.5(^-0.02) p var <0.02 
n pv =\l0(0.07-p va[ ) /> w <0.07 

1 Pvar^O.07 



(78) 



1- 



1 



p, <0.85 



(l + e 



,-5(p,-0 85) 
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1- 
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p, >0.85 



(l + e 



,-13(p,-0 85) 



) 



The voicing measure of the previous frame v prev determines the weighted sum of the 
transformed parameters which results in the voicing measure: 



[001 13] The weights used in the above sum are in accordence with the degree of 
correlation of the parameter to the voicing of the signal. Thus, the pitch gain receives 
the highest weight since it is most strongly correlated, followed by the PW 
correlation. The 1 st reflection coefficient and low-band nonstationarity measure 
receive moderate weights. The weights also depend on whether the previous frame 
was strongly voiced, in which case more weight is given to the low-band 
nonstationarity measure. The pitch variation and relative signal power receive smaller 
weights since they are only moderately correlated to voicing. 
[OOl 14] If the resulting voicing measure v is clearly in the voiced region 
(v < 0.45 ) or clearly in the unvoiced region (v > 0.6) , it is not modified further. 
However, if it lies outside the clearly voiced or unvoiced regions, the parameters are 
examined to determined if there is a moderate bias towards a voiced frame. In such a 
case, the voicing measure is modified so that its value lies in the voiced region. 
[001 15] The resulting voicing measure v takes on values in the range 0-1, with 
lower values for more voiced signals. In addition, a binary voicing measure flag is 
derived from the voicing measure as follows: 



v = 



Q35n pg + 0.225w pH , + 0. 15n R + OmSn E + 0.07^ + 0. 12rc p v prev < 0.3 
035n pg + 0.2n pw + 0.1n* +0An E + 0.05n pv +0.2w p v prev > 0.3 



. (79) 
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10 v < 0.45, 
>H v>0.45. (80) 

[001 16] Thus, v flag is 0 for voiced signals and 1 for unvoiced signals. This flag is 

used in selecting the quantization mode for PW magnitude and the subband 
nonstationarity vector. The voicing measure V is concatenated to the subband 
nonstationarity measure vector and the resulting 6-dimensional vector is vector 
quantized. 

[001 17] The subband nonstationarity measure can have occasional spurious large 
values, mainly due to the approximations and the averaging used during its 
computation. If this occurs during voiced frames, the signal is reproduced with 
excessive roughness and the voice quality is degraded. To prevent this, large values of 
the nonstationarity measure are attenuated. The attenuation charactersitic has been 
determined experimentally and is specified as follows for each of the five subbands: 

'3t(l) v>0.6 or 91(1) < 0.3 + 0.1667v 

^ ^ 0.05 + 0.1667v + e _ 5 J n {>-° ig^j v ^ 0 - 6 and ^(1) > 0-3 + 0.1667v 

(81) 



91(2). 



91(2) 

0.2 + 0.0833v + 



v>0.6 or 91(2) < 0.45 + 0.1667v 



0.5 + 0.1667v 



(1 + e -s W2) -o 45-0,^ v < 0.6 and 91(2) > 0.45 + 0. 1 667v 

(82) 



91(3). 



91(3) 

0.1 + 0.5v + 



v>0.6 or 91(3) < 0.5 + 0.5v 



0.8 



(1 + e . s(9J( 3 ) -o5-o,-^ ^°- 6 and *(3)>0.5 + 0.5v 



(83) 
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91(4) • 



91(4) v>0.6 or 9?(4)<0.65 + 0.5833v 

0.7 + 0.5v 

Q + g-S(M(4)-0.3-0 3333v) > 



0.3 + 0.3333V + - v<0.6 and 91(4) > 0.65 + 0.5833v 



(84) 



91(5) < 



91(5) v>0.6 or 9*(5)<0.65 + 0.5833v 

3.7 + 0.5v 

(l + e" 5(9?(5) "°- 3 "° 3333 v) ) 



0.3 + 0.3333v + - v<0.6 and 9t(5) > 0.65 + 0.5833v 



(85) 

[001 18] Additionaly, for voiced frames, it is necessary to ensure that the values of 
the nonstationarity measure in the low frequency subbands are in a monotonically 
nondecreasing order. This condition is enforced for the 3 lower subbands according to 
the flow chart in FIG. 6. 

[001 19] FIG. 6 is a flow chart depicting a method 600 for enforcing mono tonic 
measures in accordance with an embodiment of the present invention. The method 
600 occurs in compute subband nonstationary measure module 116 and is initiated at 
step 602 where the adjustment for the R vector is begun. The method 600 then 
proceeds to step 604. 

[00120] At step 604 a determination is made as to whether the voicing measure is 
less than 0.6. If the determination is answered negatively, the method proceeds to 
step 622. If the determination is answered affirmatively the method proceeds to step 
606. 

[0012 1] At step 606 a determination is made as to whether Rl is greater than R2. If 
the determination is answered negatively, the method proceeds to step 614. If the 
determination is answered affirmatively, the method proceeds to step 608. 
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[00122] At step 614 a determination is made as to whether R2 is greater than R3. If 
the determination is answered negatively the method proceeds to step 622. If the 
determination is answered affirmatively, the method proceeds to step 616. 
[00123] At step 608 a determination is made as to whether .5(R1 +R2) is less than 
or equal to R3. If the determination is answered affirmatively the method proceeds to 
step 610 where a formula is used to calculate Rl and R2. The method then proceeds 
to step 614. 

[00124] If the determination at step 608 is answered negatively, the method 
proceeds to step 612 where a series of calculations is used to calculate Rl, R2 and R3. 
The method then proceeds to step 614. 

[00125] At step 616 a determination is made as to whether 0.5(R2 +R3) is greater 
than or equal to Rl. If the determination is answered affirmatively, the method 
proceeds to step 618 where a series of calculations is used to calculate R2 and R3. If 
the method is answered negatively, the method proceeds to step 620 where a series of 
calculations is used to calculate Rl, R2 and R3. 

[00126] The steps 614, 618 and 620 proceed to step 622 where the adjustment of 
the R vector ends. 

[00127] The nonstationarity measure vector is vector quantized using a spectrally 
weighted quantization. The spectral weights are derived from the LPC parameters. 
First, the LPC spectral estimate corresponding to the end point of the current frame is 
estimated at the pitch harmonic frequencies. This estimate employs tilt correction and 
a slight degree of bandwidth broadening. These measures are needed to ensure that 
the quantization of formant valleys or high frequencies are not compromised by 
attaching excessive weight to formant regions or low frequencies. 
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2>s(00.4"V 



m=0 



0<k<K, 



(86) 
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[00128] This harmonic spectrum is converted to a subband spectrum by averaging 
across the 5 subbands used for the computation of the nonstationarity measure. 
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[00129] This is averaged with the subband spectrum at the end of the previous 
frame to derive a subband spectrum that corresponding to the center of the current 
frame. This average serves as the spectral weight vector for the quantization of the 
nonstationarity vector. 

W 4 (l) = 0.5(W 0 (l) + W,(I)) l<l<5. (88) 

[00130] The voicing measure is concatenated to the end of the nonstationarity 
measure vector, resulting in a 6-dimensional composite vector. This permits the 
exploitation of the considerable correlation that exists between these quantities. The 
composite vector is denoted by 

9? C ={91(1) 91(2) 91(3) 91(4) 91(5) v}. (89) 

[00131] The spectral weight for the voicing measure is derived from the spectral 
weight for the nonstationarity measure depending on the voicing measure flag. If the 
frame is voiced ( v flag = 0) , the weight is computed as 

W 4 (6) = ^-f j W 4 (l) v„ og =0. (90) 

*> 1=1 



[00132] In other words, it is lower than the average weight for the nonstationary 
component. This ensures that that the nonstationary component is quantized more 
accurately than the voicing measure. This is desirable since for voiced frames, it is 
important to preserve the nonstationarity in the various bands to achieve the right 
degree of periodicty. On the other hand, for unvoiced frames, voicing measure is more 
important. In this case, its weight is larger than the maximum weight for the 
nonstationary component. 

W A (6) = \.5MAXW 4 {1) Vjbg =l. (91) 

[00133] A 64 level, 6-dimensional vector quantizer is used to quantize the 
composite nonstationarity measure-voicing measure vector. The first 8 codevectors 
(indices 0-7) assigned to represent unvoiced frames and the remaining 56 codevectors 
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(indices 8-63) are assigned to respresent voiced frames. The voiced/unvoiced decision 
is made based on the voicing measure flag. The following weighted MSE distortion 
measure is used: 

D R (0 = J W 4 («)[* c (m) - V R (/, m)] 2 0 < / < 63, (92) 

[00134] Here, {V R (/, m) 9 0 < / < 63, 1 < m < 6} is the 64 level, 6-dimensional 
composite nonstationarity measure-voicing measure codebook and D R (/) is the 
weighted MSE distortion for the I th codevector. If the frame is unvoiced ( v^ ag = 1) , 
this distortion is minimized over the indices 0-7. If the frame is voiced (v flag = 0) , the 
distortion is minimized over the indices 8-63. Thus, 



r\ mm 



MIND R (l) v^=l 

0</<7 (93) 

MM DM) v na =0 



L 8</<63 

[00135] This partitioning of the codebook reflects the higher importance given to 
the representation of the nonstationarity measure during voiced frames. The 6-bit 
index of the optimal codevector /* is transmitted to the decoder as the nonstationarity 
measure index. It should be noted that the voicing measure flag, which is used in the 
decoder 100B for the inverse quantization of the PW magnitude vector, can be 
detected by examining the value of this index. 

[00136] Up to this point, the PW vectors are processed in Cartesian (i.e., real- 
imaginary) form. The FDI codec 100 at 4.0 kbit/s encodes only the PW magnitude 
information to make the most efficient use of available bits. PW phase spectra are not 
encoded explicitly. Further, in order to avoid the computation intensive square-root 
operation in computing the magnitude of a complex number, the PW magnitude- 
squared vector is used during the quantization process. 

[00137] The PW magnitude vector is quantized using a hierarchical approach, 
which allows the use of fixed dimension VQ with a moderate number of levels and 
precise quantization of perceptually important components of the magnitude 
spectrum. In this approach, the PW magnitude is viewed as the sum of two 
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components: a PW mean component, which is obtained by averaging the PW 
magnitude across frequencies within a 7 band sub-band structure, and a PW deviation 
component, which is the difference between the PW magnitude and the PW mean. 
The PW mean component captures the average level of the PW magnitude across 
frequency, which is important to preserve during encoding. The PW deviation 
contains the finer structure of the PW magnitude spectrum and is not important at all 
frequencies. It is only necessary to preserve the PW deviation at a small set of 
perceptually important frequencies. The remaining elements of PW deviation can be 
discarded, leading to a small, fixed dimensionality of the PW deviation component. 
[00138] The PW magnitude vector is quantized differently for voiced and unvoiced 
frames as determined by the voicing measure flag. Since the quantization index of the 
nonstationarity measure is determined by the voicing measure flag, the PW magnitude 
quantization mode information is conveyed without any additional overhead. 
[00139] During voiced frames, the spectral characteristics of the residual are 
relatively stationary. Since the PW mean component is almost constant across the 
frame, it is adequate to transmit it once per frame. The PW deviation is transmitted 
twice per frame, at the 4 th and 8 th subframes. Further, interframe predictive 
quantization can be used in the voiced mode. On the other hand, unvoiced frames tend 
to be nonstationary. To track the variations in PW spectra, both mean and deviation 
components are transmitted twice per frame, at the 4 th and 8 th subframes. Prediction is 
not employed in the unvoiced mode. 

[00140] The PW magnitude vectors at subframes 4 and 8 are smoothed by a 3- 
point window. This smoothing can be viewed as an approximate form of decimation 
filtering to down sample the PW vector from 8 vectors/frame to 2 vectors/frame. 



[00141] The subband mean vector is computed by averaging the PW magnitude 
vector across 7 subbands. The subband edges in Hz are 



P m (*) - 0.3F„, (*) + 0.4P m (k) + 0.3P m+1 (*), 0 < k < K m , m = 4,8. 



(94) 



B =[\ 400 800 1200 1600 2000 2600 3400} 



(95) 
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[00142] To average the PW vector across frequencies, it is necessary to translate 
the subband edges in Hz to subband edges in terms of harmonic indices. The band- 
edges in terms of hamonic indices for subframes 4 and 8 can be computed by 
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[00143] The mean vectors are computed at subframes 4 and 8 by averaging over 
the harmonic indices of each subband. It should be noted that, as mentioned earlier, 
since the PW vector is available in magnitude-squared form, the mean vector is in 
reality a RMS vector. This is reflected by the following equation. 



^(0 = 



1 



K«('+l)-l 



£ \P m (k)\\ 0</<6,m = 4,8. (97) 



]K m (/ + l)-K m (0 k=Km(l) 

[00144] The mean vector quantization is spectrally weighted. The spectral 
weight vector is computed for subframe 8 from LP parameters as follows: 

10 

]Ta;(8)(0.4)V^ w 



wAky 



1=0 



^a;(8)(0.98)'e" >8W 

/=0 



(98) 



[00145] The spectral weight vector is attenuated outside the band of interest, so 
that out-of-band PW components do not influence the selection of the optimal code- 
vector. 

W,{k)<=W % {kW\ 0<£<k 8 (0) or k 8 (7)<*<^ 8 . (99) 
[00146] The spectral weight vector for subframe 4 is approximated as an average 
of the spectral weight vectors of subframes 0 and 8. This approximation is used to 
reduce computational complexity of the encoder. 

W 4 (k) = 0.5(W Q (k) + W % (k)), 0<k<K 4 . (100) 
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[00147] The spectral weight vectors at subframes 4 and 8 are averaged over 
subbands to serve as spectral weights for quantizing the subband mean vectors: 

K„(l+l)-l 

Kd) = -7—r, 7^ 2XW> 0</<6,m = 4,8. (101) 

K m 0 + l)-K m (V) k £ M 

[00148] The mean vectors at subframes 4 and 8 are vector quantized using a 7 bit 
codebook. A precomputed DC vector {P DC UV (i),0 < i < 6} is subtracted from the 
mean vectors prior to quantization. The resulting vectors are matched against the 
codebook using a spectrally weighted MSE distortion measure. The distortion 
measure is computed as 



D PWM uv (mJ) = ±wM^wM_vv(U)-(PJi)-P D c_uv(») 1 f 0</<127, W = 4,8. 

(102) 

Here, {V PWM ^(/,0,0</<127,0<i <6} is the 7-dimensional, 128 level unvoiced 
mean codebook. Let l PWMUVA and l* PWM _ uv z be the codebook indices that 
minimize the above distortion for subframes 4 and 8 respectively, i.e., 

D PWM _ uv {mX PWM _av_J = MIND PWMUV (rnJ), m = 4,8. (103) 

The quantized subband mean vectors are given by adding the optimal codevectors 
to the DC vector: 

P mq (0 = P DC _vv (0 + Vpwm w i&mi _vv^ , 0 0 < i < 6 5 HI = 4,8. (1 04) 

[00149] The quantized subband mean vectors are used to derive the PW deviations 
vectors. This makes it possible to compensate for the quantization error in the mean 
vectors during the quantization of the deviations vectors. Deviations vectors are 
computed for subframes 4 and 8 by subtracting fullband vectors constructed using 
quantized mean vectors from original PW magnitude vectors. The fullband vectors are 
obtained by piecewise-constant approximation across each subband: 

0 k<K m (i),m = 4& 

S m (*) = P mg (0, k w (0 < k <K m (i + 1), 0 < i < 6, m = 4,8, (1 05) 

0 K„(7)<*<i^,m = 4,8. 
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[00150] The deviation vector is quantized only for a small subset of the harmonics, 
which are perceptually important. There are a number of approaches to selecting the 
harmonics, by taking into account the signal characteristics, spectral energy 
distribution etc. This embodiment of the present invention uses a simple approach 
where harmonics 1-10 are selected. This ensures that the low frequency part of the 
speech spectrum, which is perceptually important is reproduced more accurately. 
Taking into account the fact that the PW vector is available in magnitude-squared 
form, harmonics 1-10 of the deviation vector are computed as follows: 

Fm (k) = JF m (kstart m +k)-S m (kstart m + k), 1 < k < 10, m = 4,8. (106) 
[00151] Here, kstart m is computed so that harmonics below 200 Hz are not 
selected for computing the deviations vector: 
0, K m <2Q, 

kstart m = J 1, 20 < K m < 40, m = 4,8. (107) 
2, 40<K m . 

[00152] The quantization of deviations vectors is carried out by a 6-bit vector 
quantizer using spectrally weighted MSE distortion measure. 

D PWD uv (mJ) = f j W m (k + kstartj[v P ^_ uy (l,k)-F m (k)] 2 0 < / < 63, m = 4,8. 

k=\ 

(108) 

[00153] Here, {V PWD uv (l,k),0<l <63,\<k <10} is the 10-dimensional, 63 level 
unvoiced deviations codebook. Let l' PWD U y _ 4 and l' PWD UV J , be the codebook indices 
that minimize the above distortion for subframes 4 and 8 respectively, i.e., 
D PWD _ U¥ {mX PWD _ U y_J = MmD pmJJV {m,l), m = 4,8. (109) 

[00154] The quantized deviations vectors are the optimal code-vectors: 

F mq {i) = V PWD _ UY {l PWDJJVm ,k) l<*<10,m = 4,8. (110) 
[00155] The two 7-bit mean quantization indices / 

pwm uv 4' lpwM_uv_% and the 
two 6-bit deviation indices l* PWD m 4 , l PWD _uv_s represent the PW magnitude 
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information for unvoiced frames using a total of 26 bits. In addition, a single bit is 
used to represent the binary VAD flag during unvoiced frames only. 
[00156] In the voiced mode, the PW magnitude vector smoothing, the computation 
of harmonic subband edges and the PW subband mean vector at subframe 8 take place 
as in the case of unvoiced frames. In contrast to the unvoiced case, a predictive VQ 
approach is used where the quantized PW subband mean vector at subframe 0 (i.e., 
subframe 8 of previous frame) is used to predict the PW subband mean vector at 
subframe 8. A prediction coefficient of 0.5 is used. A predetermined DC vector is 
subtracted prior to prediction. The resulting vectors are quantized by a 7-bit codebook 
using a spectrally weighted MSE distortion measure. The subband spectral weight 
vector is computed for subframe 8 as in the case of unvoiced frames. The distortion 
computation is summarized by 

D PWM Al) = j^W z {i)]y pm V Q v {i))] 0</<127. 

1=0 

(111) 

[00157] Here, {V PWM K (7,0,0 <l < 127,0 < i < 6} is the 7-dimensional, 128 level 
voiced mean codebook, {P DC v (0,0 < i < 6} is the voiced DC vector. 
{P 0q (0,0 < i < 6} is the predictor state vector which is same as the quantized PW 
subband mean vector at subframe 8 (i.e., {P %q (0,0 < i < 6} ) of the previous frame 
where l* PWM v is the codebook index that minimizes the above distortion, i.e., 

Dpwm vtfpwM v ) = MIND pwM y (l), (H2) 

" 0</<L 2 7 

[00158] The quantized subband mean vector at subframe 8 is given by adding the 
optimal code-vector to the predicted vector and the DC vector: 

P %q (0 = ^(0.1,?^ (0 + 0.5(P 0? (0 - P DC V (0) + v PWM V {f pm4 v ,0) 0 < i < 6. 

(113) 

[00159] Since the mean vector is an average of PW magnitudes, it should be a 
nonnegative value. This is enforced by the maximization operation in the above 
equation 113. 
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[00160] A fullband mean vector {S 8 (k),0 <k<K 8 } is constructed at subframe 8 

using the quantized subband mean vector, as in the unvoiced mode. A subband mean 
vector is constructed for subframe 4 by linearly interpolating between the quantized 
subband mean vectors of subframes 0 and 8: 

P 4 (0 = 0.5(P 0ff (0 + P 8ff (0) 0<f<6. (114) 
[00161] A fullband mean vector {S 4 (fc),0 < k < K 4 } is constructed at subframe 4 
using this interpolated subband mean vector. By subtracting these fullband mean 
vectors from the corresponding magnitude vectors, deviations vectors 
{F 4 {k),\ < k < 10} and {F s {k\\ < k < 10} are computed at subframes 4 and 8. Note 
that these deviations vectors are computed only for selected harmonics, i.e., 
harmonics {kstart m + 1) - (kstart m +10) as in the unvoiced case. The deviations 

vectors are predictively quantized based on prediction from the quantized deviation 
vector from 4 subframes ago i.e, subframe 4 is predicted using subframe 0, subframe 
8 using subframe 4. A prediction coefficient of 0.55 is preferably used. 
[00162] The deviations prediction error vectors are quantized using a multi-stage 
vector quantizer with 2 stages. The 1 st stage uses a 64-level codebook and the 2 nd 
stage uses a 16-level codebook. Another embodiment of the present invention 
considers only the 8 best candidates from the 1 st codebook in searching the 2 nd 
codebook which is used to reduce complexity. The distortion measures are spectrally 
weighted. The spectral weight vectors {W 4 (k)fl < k < 1 0} , and {W 8 (£),0 < k < 1 0} 

computed as in the unvoiced case. The 1 st codebook uses the following distortion to 
find the 8 codevectors with the smallest distortion: 

Dpwd vMJ)^W m {k^Mart m )^ PWD vx (/,£)- F m (k) + 0.55F (m _ 4)q (&)] 2 0</<63, 

(115) 

where {j PWD v m (i\0 <i<7} is the 8 indices associated with the 8 best codewords. 

The entire 2 nd codebook is searched for each of the 8 codevectors from the 1 st 
codebook, so as to minimize the distortion between the input vector and the sum of 
the 1 st and 2 nd codebook vectors: 
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MIN D PTO _ K (m,0 = 5X(*)^ 

h^JpWD V m k = \ 

0</ 2 <15" _ 

(116) 

where /, = l* PWD V} 4 and l 2 = f PWD V2 4 minimize the above distortion for sub frame 4 
and /, = n 8 and / 2 = /*^ F2 8 minimize the above distortion for subframe 8. 
Then, the 7-bit mean quantization index T PWM v , the 6-bit index l PWD Vl 4 , the 4-bit 
index l* PWD n 4 , the 6-bit index l* PWD Vl 8 and the 4-bit index l PWD VX 8 together 

represent the 27 bits of PW magnitude information for voiced frames. It should be 
noted that voiced frames are implicitly assumed to be active which removes the need 
for transmitting the VAD flag. 

[00163] In the unvoiced mode, the VAD flag is explicitly encoded using a binary 
index f VAD UV : 

r vAD _ uv =VAD_FLAG. (117) 

[00164] In the voiced mode, it is implicitly assumed that the frame is active speech. 
Consequently, it is not necessary to explicitly encode the VAD information. 
[00165] In a preferred embodiment, at 4 kb/s, the following table 1 summarizes 
the bits allocated to the quantization of the encoder parameters under voiced and 
unvoiced modes. As indicated in the table, a single parity bit is included as part of the 
80 bit compressed speech packet. This bit is intended to detect channel errors in a set 
of 24 critical (Class 1) bits. Class 1 bits consist of the 6 most significant bits (MSB) of 
the PW gain bits, 3 MSBs of 1 st LSF, 3 MSBs of 2 nd LSF, 3 MSBs of 3 rd LSF, 2 
MSBs of 4 th LSF, 2 MSBs of 5 th LSF, MSB of 6 th LSF, 3 MSBs of the pitch index 
and MSB of the nonstationarity measure index. The single parity bit is obtained by an 
exclusive OR operation of the Class 1 bit sequence. It will be appreciated by those 
skilled in the art that other bit allocations can be used and still fall within the scope of 
the present invention. 
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Voiced Mode 


Unvoiced Mode 


Pitch 


7 


7 


LSF Parameters 


31 


31 


PW Gain 


8 


8 


Nonstationarity & voicing Measure 


6 


6 


PW Magnitude 


Mean 


7 


14 


Deviations 


20 


12 


VAD Flag 


0 


1 


Parity Bit 


1 


1 


Total / 20 ms Frame 


80 


80 



TABLE 1 



[00166] The present invention will now be discussed with reference to decoder 
100B. The decoder receives the 80 bit packet of compressed speech produced by the 
encoder and reconstructs a 20 ms segment of speech. The received bits are unpacked 
to obtain quantization indices for the LSF parameter vector, the pitch period, the PW 
gain vector, the nonstationarity measure vector and the PW magnitude vector. A 
cyclic redundancy check (CRC) flag is set if the frame is marked as a bad frame. For 
example this could be due to frame erasures or if the parity bit which is part of the 80 
bit compressed speech packet is not consistent with the class 1 bits comprising the 
gain, LSF, pitch and nonstationarity measure bits. Otherwise, the CRC flag is cleared. 
If the CRC flag is set, the received information is discarded and bad frame masking 
techniques are employed to approximate the missing information. 
[00167] Based on the quantization indices, LSF parameters, pitch, PW gain vector, 
nonstationarity measure vector and the PW magnitude vector are decoded. The LSF 
vector is converted to LPC parameters and linearly interpolated for each subframe. 
The pitch frequency is interpolated linearly for each sample. The decoded PW gain 
vector is linearly interpolated for odd indexed sub frames. The PW magnitude vector 
is reconstructed depending on the voicing measure flag, obtained from the 
nonstationarity measure index. The PW magnitude vector is interpolated linearly 
across the frame at each subframe. For unvoiced frames (voicing measure flag =1), 
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the VAD flag corresponding to the look-ahead frame is decoded from the PW 
magnitude index. For voiced frames, the VAD flag is set to 1 to represent active 
speech. 

[00168] Based on the voicing measure and the nonstationarity measure, a phase 
model is used to derive a PW phase vector for each subframe. The interpolated PW 
magnitude vector at each subframe is combined with a phase vector from the phase 
model to obtain a complex PW vector for each subframe. 

[00169] Out-of-band components of the PW vector are attenuated. The level of the 
PW vector is restored to the RMS value represented by the PW gain vector. The PW 
vector, which is a frequency domain representation of the pitch cycle waveform of the 
residual, is transformed to the time domain by an interpolative sample-by-sample 
pitch cycle inverse DFT operation. The resulting signal is the excitation that drives the 
LP synthesis filter, constructed using the interpolated LP parameters. Prior to 
synthesis, the LP parameters are bandwidth broadened to eliminate sharp spectral 
resonances during background noise conditions. The excitation signal is filtered by 
the all-pole LP synthesis filter to produce reconstructed speech. Adaptive postfiltering 
with tilt correction is used to mask coding noise and improve the peceptual quality of 
speech. 

[00170] The pitch period is inverse quantized by a simple table lookup operation 
using the pitch index. It is converted to the radian pitch frequency corresponding to 
the right edge of the frame by 

©(160) = 2*. (118) 
P 

where p is the decoded pitch period. A sample by sample pitch frequency contour is 

created by interpolating between the pitch frequency of the left edge co (0) and the 

pitch frequency of the right edge cS(160) : 

(160-*)cd(0) + »co(160) 0 <„< 160 (119) 
V ' 160 

[00171] If there are abrupt discontinuities between the left edge and the right edge 
pitch frequencies, the above interpolation is modified as in the case of the encoder. 
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l<m<8. (120) 



X(m) 



Note that the left edge pitch frequency co (0) is the right edge pitch frequency of the 
previous frame. 

[00172] The index of the highest pitch harmonic within the 4000 Hz band is 
computed for each subframe by 

co (20m) 

[00173] The LSFs are quantized by a hybrid scalar-vector quantization scheme. 
The first 6 LSFs are scalar quantized using a combination of intraframe and 
interframe prediction using 4 bits/LSF. The last 4 LSFs are vector quantized using 7 
bits. 

[00174] The inverse quantization of the first 6 LSFs can be described by the 
following equations: 

■S .('I _ )" 0.375*^011 + 1), m = 0 

[S L , m (T L _ S _ M ) + 0.375(5^ (m + 1) - X prev (m - 1)) + X(m - 1), 1 < m < 

Here, {/* s m ,0 < m < 6} are the scalar quantizer indices for the first 6 LSFs, 
{%(m),Q <m < 6} are the first 6 decoded LSFs of the current frame and 
$> prev (m),Q <m<\0} are the decoded LSFs of the previous frame, 
tSi,«(0»o < m < 6,0 < / < 15} are the 1 6 level scalar quantizer tables for the first 6 LSFs. 
The last 4 LSFs are inverse quantized based on the predetermined mean values 
X dc (m) and the received vector quantizer index for the current frame: 

X(rn) - V L (/; _ v , m - 6) + X dc (m) + 0.5(^ ev (m) - X dc (m)\ 6<m<9. 

Here, f L _ v is the vector quantizer index for the last 4 LSFs, {A!(m),0 < m < 6} and 

{V L (l, m),0 < I < 127,0 < m < 3} is the 128 level, 4-dimensional codebook for the last 4 
LSFs. The stability of the inverse quantized LSFs is checked by ensuring that the 
LSFs are monotonically increasing and are separated by a minimum value of 
preferably 0.008. If this property is not satisfied, stability is enforced by reordering 
the LSFs in a monotonically increasing order. If a minimum separation is not 
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achieved, the most recent stable LSF vector from a previous frame is substituted for 
the unstable LSF vector. 

[00175] When the received frame is inactive, the decoded LSF's are used to update 
an estimate for background LSF's using the following recursive relationship: 

Kn 0») = °- 98 Kn (™) + 0.02X(m), 0 < m < 9. (123) 
[00176] In order to improve the performance of the codec 1 00 in the presence of 
background noise, we replace the curent decoded LSF's by an interpolated version of 
the inverse quantized LSF's , background noise LSF's, and a DC value of the 
background noise LSF's during frames that are not only active but which follow 
another active frame, i.e., 
X(m) = 025X(m) + 0.25^ (m) + 0.5X bgndc (ml 0<m<9 (124) 

[00177] For transitional frames, i.e., frames which are transitioning from active to 
inactive or vice- versa, the interpolation weights are altered to favor the inverse 
quantized LSF's, i.e., 

X(m) = 0.5X(m) + 0.25^ (m) + 0.25X bgndc (m\ 0<m<9 (125) 

[00178] The inverse quantized LSFs are interpolated each sub frame by linear 
interpolation between the current LSFs {X(m\0 < m < 10} and the previous LSFs 

{X prev (m),0 < m < 10} . The interpolated LSFs at each subframe are converted to LP 
parameters {a m (/),0 < m < 10,1 < / < 8} . 

[00179] Inverse quantization of the PW nonstationarity measure and the voicing 
measure is a table lookup operation. If /* is the index of the composite 
nonstationarity measure and the voicing measure, the decoded nonstationarity 
measure is 

£,(0 = ^(C0, 1^5. (126) 
Here, {V R (Z, m), 0 < / < 63, 1 < m < 6} is the 64 level, 6-dimensional codebook used 
for the vector quantization of the composite nonstationarity measure vector. The 
decoded voicing measure is 
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v = V R (ll,6). (127) 
[00180] A voicing measure flag is also created based on /* as follows: 

^4° 5 >? 028, 



II l R <7. 



This flag determines the mode of inverse quantization used for PW magnitude. 
[00 181] The decoded nonstationarity measure may have excessive values due to 
the small number of bits used in encoding this vector. This leads to excessive 
roughness during highly periodic frames, which is undesirable. To control this 
problem, during sustained intervals of highly periodic frames the decoded 
nonstationarity measure is subjected to upper limits, determined based on the decoded 
voicing measure. If /* denotes the nonstationarity measure index received for the 

preceding frame, these rules can be expressed as follows: 

_0.95 
\ + e 

9^(0) otherwise. 



* 2 (°) = 



M7Ar(^ 1 (0),0.05 + i ■ __ 8( ,_ 035) ), C >3land! Rprev >31 



MINWW _ 8(v -_ 025) ), r R >3l and r R prev >3l 

)H 2 (i) = < i+e 

9?j(l) otherwise. 



5H 2 (2) = 



9t 2 (3) = 



Ja<W(« 1 (2),0.25 + 2.83333(v -0.05), l' R > 31 and l' R _ prev > 31 
hft, (2) otherwise. 

\MIN&, (3),0.45 + 2.83333(v - 0.05), l* R > 31 and l' R prev > 31 
91,(3) otherwise. 



« 2 (4)= . 



Jm/N(9\(4),0.55 + 2.83333(v -0.05), /* > 31 and l\ _ prey > 31 
[91,(4) otherwise. 



[00182] In addition, for sustained intervals of highly periodic frames, it is 
desirable to prevent excessive changes in the nonstationarity measure from one frame 
to the next. This is achieved by allowing a maximum amount of permissible change 
for each component of the nonstationarity measure. The changes that result in a 
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91(0) = 

m) = 



91(3) = 
91(4) = 



decrease of the nonstationarity measure are not limited. Rather, the changes that 
increase the nonstationarity measure are limited by this procedure. If 91 prev denotes the 

modified nonstationarity measure of the preceding frame, this procedure can be 
summarized as follows: 

MIN(n 2 (0),M prev (0) + 0.06X l' R >31 and l' R prev >31 
9^(0) otherwise, 
|AfflV(9i 2 (l),9\ prev (l) + 0.10) 5 /; >3\andl Rprev >3\ 
[5R,(1) otherwise. 
fM/iV(9l 2 (2),9? prev (2) + 0.16), /; >31 and l' R prev > 31 
[9^(2) otherwise. 
\MM{K 2 Q),'k prev Q) + Q2A), l R >3\andl' R prev >3\ 
[91,(3) otherwise. 
fM/AA(9i 2 (4),9l /;rev (4) + 0.27), /; >3landl R _ prev >31 
[9?, (4) otherwise. 
[00183] The gain vector is inverse quantized by a table look-up operation. It is then 
linearly transformed to reverse the trasformation at the encoder. If /* is the gain 
index, the gain values for the even indexed subframes are obtained by 

g pw (2m) = l<m<4. (135) 

where, {V g (/, m), 0 < / < 255, 1 < m < 4} is the 256 level, 4-dimensional gain 
codebook. 

[00184] The gain values for the odd indexed subframes are obtained by linearly 
interpolating between the even indexed values: 

g pw (2m-l) = 0.5(g pw (2m-2) + g pw (2m)), \<m<4. (136) 
The gain values are now expressed in logarithmic units. They are converted to linear 
units by 

§' (m) = 10^ (M) , l<m<8. (137) 
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This gain vector is used to restore the level of the PW vector during the generation of 
the excitation signal. 

[00185] Based on the decoded gain vector in the log domain, long term average 
gain values for inactive frames and active unvoiced frames are computed. These gain 
averages are useful in identifying inactive frames that were marked as active by the 
VAD. This can occur due to the hangover employed in the VAD or in the case of 
certain background noise conditions such as babble noise. By identifying such frames, 
it is possible to improve the performance of the codec 100 for background noise 
conditions. 

[00186] FIG. 7 is a flowchart for a method 700 for computing gain averages in 
accordance with an embodiment of the present invention. The method 700 is 
performed at the decoder 100B prior to being processed by modules 124 and 126 and 
is initiated at 702 where computation of Gavg bg and Gavg uv begins. The method 700 
then proceeds to step 704 where a determination is made as to whether 
rvad_flagjfinal and rvad_flagJDL2 equal zero and badframe flag is false is met. If the 
determination is negative, the method proceeds to step 712. 

[00187] At step 712 a determination is made as to whether rvad_flag_final equals a 
one and 1r is less than 8 and bad frame flag equals false, if the determination is 
negative the method proceeds to step 720. If the determination is affirmative, the 
method proceeds to step 714. 

[00188] At step 714 a determination is made as to whether n uv is less than 50. If the 
determination is answered negatively then the method proceeds to step 716 where 
Gavg U v is calculated using a first equation. If the method is answered negatively, the 
method proceeds to step 718 where a second equation is used to calculate Gavg uv . 
[00189] If the determination at step 704 is negative, the method proceeds to step 
706 where a determination of whether nbg is less than 50 is determined. If the 
determination is answered negatively, the method proceeds to step 708 where Gavg- 
tmpbg is calculated using a first equation. If the determination is answered 
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affirmatively, the method proceeds to step 710 where Gavg-tmp bg is calculated using 
a second equation. 

[00190] The steps 708, 710, 716, 718 and 712 proceed to step 720 where Gavg bg is 
calculated. The method then proceeds to step 722 where the computation ends for 
Gavgbg and Gavg uv . 

[00191] First an average gain is computed for the entire frame: 

g avg =lts pw (^)- (138) 

Long term average gains for inactive frames which represent the background signal 
and unvoiced frames are computed according to the method 700. 
[00192] The decoded voicing measure flag determines the mode of inverse 
quantization of the PW magnitude vector. If v fiag is a zero, voiced mode is used and if 

Vjiag ls a one > unvoiced mode is used. 

[00193] In the voiced mode, the PW mean is transmitted once per frame and the 
PW deviation is transmitted twice per frame. Further, interframe predictive 
quantization is used in this mode. In the unvoiced mode, mean and deviation 
components are transmitted twice per frame. Prediction is not employed in the 
unvoiced mode. 

[00194] In the unvoiced mode, the VAD flag is explicitly encoded using a binary 
index ly AD _ uv . In this mode, VAD flag is decoded by 

RVAD_FLAG = \° (139) 

[00195] In the voiced mode, it is implicitly assumed that the frame is active speech. 
Consequently, it is not necessary to explicitly encode the VAD information. VAD flag 
is set to 1 indicating active speech in the voiced mode: 

RVAD _FLAG = 1. (140) 
[00196] It should be noted that the RVAD _FLAG is the VAD flag corresponding 
to the look-ahead frame where 
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R VAD _ FLA G, R VAD _ FLA G _ DL1, R VAD _ FLA G _ DL2 denote the V AD flags of 
the look-ahead frame, current frame and the previous frame respectively. A composite 
VAD value, R VAD _ FLAG _ FINAL , is determined for the current frame, based on 
the above VAD flags, according to the following table 2: 

RVAD _FLAG _DL2 RVAD _ FLA G _ DLl RVAD FLAG RVAD _ FLAG „ FINAL 



0 0 0 0 

0 0 11 

0 10 0 

0 112 

1 0 0 1 
1 0 13 
1 10 2 
1 113 



TABLE 2 

The RVAD _ FLAG _ FINAL is zero for frames in inactive regions, three in active 
regions, one prior to onsets and a two prior to offsets. Isolated active frames are 
treated as inactive frames and vice versa. 

[00197] In the unvoiced mode, the mean vectors for sub frames 4 and 8 are inverse 
quantized as follows: 

D m (0 = P DCUV (0 + V PWM _ uv (t PWM _ UVm9 i) 0<i<6,m = 4,8. (141) 
Here, {D 4 (i),0<i <6} and 0 % (i) 9 O < i < 6} are the inverse quantized 7-band subband 
PW mean vectors, {V PWM _ UV (IJ),0<1 <127,0<z <6} is the 7-dimensional, 128 level 
unvoiced mean codebook. 1 FWM UV 4 and l* PWM _ uv _ s are the indices for mean vectors 
for the 4 th and 8 th subframes. {P DC uv (0,0 < i < 6} is a predetermined DC vector for 
the unvoiced mean vectors. 

[00198] Due to the limited accuracy of PW mean quantization in the unvoiced 
mode, it is possible to have high values of PW mean at high frequencies. This in 
conjunction with a LP synthesis filter which emphasizes high frequencies can cause 
excessive high frequency content in the reconstructed speech, leading to poor voice 
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quality. To control this condition, the PW mean values in the uppermost two subbands 
is attenuated if it is found to be high and the LP synthesis filter has a frequency 
response with a high frequency emphasis. 

[00199] The magnitude squared frequency response of the LP synthesis filter is 
averaged across two bands, 0-2 kHz and 2-4 kHz: 



k=l 



10 

Y j a^{m)e' j 



jw(\60)hn 



m=0 



(142) 



£a 8 (iw)« 



~yTv(160)£m 



m=0 



(143) 



Here, {a s (m)} are the decoded, interpolated LP parameters for the 8 th sub frame of the 
current frame, w(160) is the decoded pitch frequency in radians for the 160 th sample 
of the current frame and |_ J denotes truncation to integer. A comparison of the low 
band sumS lb against the high band sum S hb can reveal the degree of high frequency 
emphasis in the LP synthesis filter. 

[00200] An average of the PW magnitude in the 1 st 5 subbands is computed, for 
sub frames 4 and 8, as follows: 



(144) 



The attenuation of the PW mean in the 6 th and 7 th subbands is performed according to 
the flowchart 800 in FIG. 8. 

[00201] FIG. 8 is a flow chart depicting a method 800 for computing the 
attenuation of PW mean high frequency in the unvoiced bands in accordance with an 
embodiment of the present invention. The method 800 is performed at the decoder 
100B prior to being processed by modules 124 and 126 and is initiated at step 802 
where the adjustment of PW mean high frequency bands is begun for sub frames 4 and 
8. The method proceeds to step 804 where a determination of whether rvad_flag_fmal 
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equals zero is determined. If the determination is answered negatively, the method 
proceeds to step 806 where D m (5) and D m (6) are calculated. If the determination is 
answered negatively, the method proceeds to step 808. 
[00202] At step 808, a determination is made as to whether Sib is less than 
0.0724S h b- If the determination is answered negatively the method proceeds to step 
810 where a determination is made as to whether l*R_prev is less than 8 and 1* R is less 
than or equal to 5. If the determination at step 810 is answered negatively the method 
proceeds to step 812 where D m (5) and D m (6) are calculated. If the determination at 
step 812 is answered affirmatively, the method proceeds to step 814. 
[00203] At step 814, the Gavgm is computed. The method then proceeds to step 
816 where a determination is made as to whether n bg is greater than or equal to 50, n uv 
is greater than or equal to 50, and Gavg is less than Gavgxh- If the determination is 
answered negatively the method proceeds to step 812. If the determination is 
answered affirmatively the method proceeds to step 818. 

[00204] At step 818, the slope is calculated. The method then proceeds to step 820 
where G a , D m (5) and D m (6) are calculated. 

[00205] If the determination at step 808 is answered affirmatively, the method 
proceeds to step 822 where D m (5) and D m (6) are calculated. The method then 
proceeds to step 824. 

[00206] Steps 806, 822, 820 and 822 all proceed to step 824 where the adjustment 
for the PW mean ends for sub frames 4 and 8. 

[00207] The deviation vectors for subframes 4 and 8 are inverse quantized as 
follows: 

tik) = V PWD UV ,*), 1 < k < 10, m = 4,8. (145) 

Here, {F 4 (k),l < k < 1 0} and {F t (k),l < k <10} are the inverse quantized PW 
deviation vectors. {V PWD uv (l,k),0 < I < 63,1 <k<10} is the 10-dimensional, 64 level 
unvoiced deviations codebook. l PWD uv 4 and l PWDUVS are the indices for deviations 
vectors for the 4 th and 8 th subframes. 
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[00208] The subband mean vectors are converted to fullband vectors by a 
piecewise constant approximation across frequency. This requires that the subband 
edges in Hz are translated to subband edges in terms of harmonic indices. Let the 
band edges in Hz be defined by the array 

B pw =[l 400 800 1200 1600 2000 2600 3400} (146) 



The band edges can be computed by 

2 + 

K m 0")=< 



B pw m m 


< 


1 + 




4000 






4000 j ; 



1+ 



B pw ii)K n 
4000 

4000 



< 



B pK W m 
4000 

otherwise. 



4000© „ 
4000© . ' 



>,0<i<7,m = 4 



The full band PW mean vectors are constructed at subframes 4 and 8 by 

0 K m (0) >£,/« = 4,8, 

S m (*) = D m (/), < m (0 < * < K m (i + 1), 0 < i < 6, m = 4,8, 
0 K m (7)<k<K m ,m = 4,8. 

[00209] The PW magnitude vector can then be reconstructed for subframes 4 and 8 
by adding the full band PW mean vector to the deviations vector. In the unvoiced 
mode, the deviations vector is assumed to be zero at the unselected harmonic indices. 

0 k = 0, m = 4,8, 

MAX(0A5,S m (k + kstart m ) + F m {k% 1 < k < 10, m = 4, 
MAX(0. 1 5, S m (k + kstart m )), 1 1 < k < K m , m = 

0 K m < k < 60, m = 4,8 

Here, kstart m is computed in the same manner as in the encoder in equation (107). 

[00210] The PW magnitude vector is reconstructed for the remaining subframes 
by linearly interpolating between subframes 0 and 4 (for subframes 1 , 2 and 3) and 
between subframes 4 and 8 (for subframes 5, 6 and 7): 



P m (k + kstart m ) = < 
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\4-m)P 0 (k) + mP 4 (k) f 0 < k <K m ,m = 1,2,3, 

1 P m (k) = \ „ 4 

(8-m)/> W + ( W -4)/S W> 0 <,<^ w = 5j6>7 . 
I 4 

(150) 

[002 11] In the voiced mode, the mean vector for sub frame 8 is inverse quantized 
based on interframe prediction: 

A (0 = MAX(0. 1, P DC V (0 + O.50 o (z) - P DC K (0) + v PWM V (l* PWM y , 0) o < i 

Here, 0 8 (O>° ^ * ^ 6} is the 7-band subband PW mean vector, 
Wpwm v V>i),0 ^ 1 ^ 127,0 < i < 6} is the 7-dimensional, 128 level voiced mean 
codebook, l* PWM v is the index for mean vector 8 th subframe and {P DC v {i)fi < z < 6} 
is a predetermined DC vector for the voiced mean vectors. Since the mean vector is an 
average of PW magnitudes, it should be nonnegative. This is enforced by the 
maximization operation in the above equation. 

[00212] As in the case of unvoiced frames, if the values of PW mean in the highest 
two bands are excessive, and this occurs in conjuntion with LP synthesis filter with a 
high frequency emphasis, attenuation is applied to the PW mean values in the highest 
two bands. The magnitude squared frequency response of the LP synthesis filter is 
averaged across two bands, 0-2 kHz and 2-4 kHz, as in the unvoiced mode. An 
average of the PW magnitude in the 1 st 5 subbands is computed for subframe 8, as in 
the unvoiced mode. Based on these values, the PW mean in the upper two bands is 
attenuated according to the flowchart shown in FIG. 9. 

[00213] FIG. 9 is a flow chart of a method 900 for attenuating PW mean high 
frequency voice bands. The method 900 is performed at the decoder 100B prior to 
being processed by modules 124 and 126 and is initiated at step 902 where the 
adjustment for the PW mean high frequency voice band for subframe 8 begins. The 
method then proceeds to step 904. 

[00214] At step 904 a determination is made as to whether Sib is less than 1.33Shb- 
If the determination is answered negatively, the method proceeds to step 906 where 
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D m (5) and D m (6) are calculated using a first equation. If the determination at step 
904 is answered affirmatively, the method proceeds to step 908 where D m (5) and D m 
(6) are calculated using a second equation. 

[00215] Steps 906 and 908 proceed to step 910 where the adjustment of the PW 
mean for high frequency bands for sub frame 8 ends. 

[002 16] A subband mean vector is constructed for sub frame 4 by linearly 
interpolating between subframes 0 and 8: 

4(0 = 0.5(4(0 + 4(0), 0 < i < 6. (152) 
The full band PW mean vectors are constructed at subframes 4 and 8 by 

0 K m (0) >*,ro = 4,8, 

S m (k) = 4(0, k w (0 < k <K m (i + 1), 0 < i < 6, m = 4,8, 
0 K m (7)<k<K m ,m = 4,S. 

The harmonic band edges {ic m (i), 0 < i < 7} are computed as in the case of unvoiced 
mode. 

[00217] The voiced deviation vectors for subframes 4 and 8 are predictively 
quantized by a multistage vector quantizer with 2 stages. These prediction error 
vectors are inverse quantized by adding the contributions of the 2 codebooks: 

K (k) = V PWD Vl {V PWD Vl m ,k) + V PWD V2 (l* PWD _ vl m ,k\ 1 < i < 10,m = 4,8 

Here, {B 4 (i)fi<i<9} and {B % (j) 9 0<i <9} are the PW deviation prediction error 

vectors for subframes 4 and 8 respectively. {V PWDVi (/, k),0 <l < 63,1 < k < 10} is the 

10-dimensional, 64 level voiced deviations codebook for the 1 st stage. 

{V PWD V2 (l,k),0 < I < 15,1 < k < 10} is the 10-dimensional, 16 level voiced deviations 

codebook for the 2 nd stage. l* PWD Vl 4 and l PWD _ V2 4 are the 1 st and 2 nd stage indices for 
the deviations vector for the 4 th subframe. l* PWD Vl z and f PWD V2 z are 1 st and 2 nd 
stage indices for the deviations vector for the 8 th subframe. The deviations vectors are 
constructed by adding the predicted components to the prediction error vectors: 

F m (k) = B m (k) + 0.55F 0 (k\ l<*<10,m = 4,8. (155) 
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It should be noted that {F 0 (k\l < k < 10} is the decoded deviations vector from 
subframe 8 of the previous frame. If the previous frame was unvoiced, this vector is 
set to zero. 

The PW magnitude vector can then be reconstructed for subframes 4 and 8 by adding 
the full band PW mean vector to the deviations vector. The deviations vector is 
assumed to be zero at the unselected harmonic indices. 

0 k = 0, m = 4,8, 

MAX(0. 1, S m (k + kstart m ) + F m (£)), 1 < k < 1 0, m 
MAX(0. 1, S m (k + kstart m )), 1 1 < k < K m ,r 

0 K m < k < 60, m 

Here, kstart m is computed in the same manner as in the encoder in equation (107). 
[00218] The PW magnitude vector is reconstructed for the remaining subframes by 
linearly interpolating between subframes 0 and 4 (for subframes 1, 2 and 3) and 
between subframes 4 and 8 (for subframes 5, 6 and 7): 

\4-m)P 0 (k) + mP 4 (k) 



P m (k + kstart m ) = 



4 



0<k<K m ,m = 1,2,3, 



{ S- m )P 4 (k)Hrn-4)h(k)^ 0 <*<^,^5,6,7. 



4 

It should be noted that {P 0 (i),0 < i < 60} is the decoded PW magnitude vector from 
subframe 8 of the previous frame. 

[00219] In the FDI codec 100, there is no explicit coding of PW phase. The salient 
characteristics related to the phase, such as the degree of stationarity of the PW (i.e., 
periodicity of the time domain residual) and the variation of the stationarity as a 
function of frequency are encoded in the form of the quantized voicing measure v 
and the vector nonstationarity measure 5R respectively. A PW phase vector is 
constructed for each subframe based on this information by a two step process. In this 
process, the phase of the PW is modeled as the phase of a weighted complex vector 
sum of a stationary component and a nonstationary component. 
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[00220] In the first step, a stationary component is constructed using the decoded 
voicing measure v . First a complex vector is constructed, by a weighted combination 
of the following: the phase vector of the stationary component of the previous, i.e., 

m - X th , sub-frame {<p m _ x (k),0 <k< K m _ x } , a random phase vector 
{y m (k) 9 0<k<K m },*nd 

a fixed phase vector that is obtained from a residual voiced pitch pulse waveform 
{ 9flx (k),0<k<K m }. 

[0022 1] In order to combine the previous phase vector which has K m _ x components 

with the random phase vector which has K m components, it may be necessary to used 
a modified version of the previous phase vector. If there is no pitch discontinuity 
between the previous and the current subframes, this modification is simply a 
truncation (if K m _ x >K m )or padding by random phase values (if K m _ x < K m ). If there 
is a pitch discontinuity, it is necessary to align the two phase vectors such that the 
harmonic frequencies corresponding to the vector elements are as close as possible. 
This may require either interlacing or decimating the previous phase vector. For 
example, if the pitch period of the current sub frame is roughly / -times that of the 
previous subframe, lK m _ x =K m .Xn this case, each element of the previous phase 
vector is interlaced with / -1 random phase values. On the other hand, if the the pitch 
period of the previous subframe is roughly / -times that of the current subframe, 
K m _ } = lK m . In this case, for each element of the previous phase vector, the next / - 1 
elements are dropped. In either case, the dimension of the modified previous phase 
vector will have the same dimension as that for the current subframe. The modified 
previous phase vector will be denoted by mm _ x (k),Q <k <K m } . 
[00222] The random phase vector provides a method of controlling the degree of 
stationarity of the phase of the stationary component. However, to prevent excessive 
randomization of the phase, the random phase component is not allowed to change 
every subframe, but is changed after several sub-frames depending on the pitch 
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period. Also, the random phase component at a given harmonic index alternates in 
sign in successive changes. At the 1 st sub-frame in every frame, the rate of 
randomization for the current frame is determined based on the pitch period. For 
highly aperiodic frames, the highest rate of randomization is used regardless of the 
pitch period. The subframes for which the random vector is updated can be 
summarized as follows: 

rate 1 : m= 1,3,5,7 /* > 7 or 20 < p < 64 
rate 2: m = 1,4,6 / * < 7 and 64<p< 90 
rate 3: m = 1,5, l* R < 1 and 90< j p<120. 

[00223] In addition, abrupt changes in the update rate of the random phase, i.e., 
from rate 1 in the previous frame to the rate 3 in the current frame or vice-versa are 
not permitted. Such cases are modified to the rate 2 in the current frame. Controlling 
the rate at which the phase is randomized is quite important to prevent artifacts in the 
reproduced signal, especially in the presence of background noise. If the phase is 
randomized every sub frame, it leads to a fluttering of the reproduced signal. This is 
due to the fact that such a randomization is not representative of natural signals. 
[00224] The random phase value is determined by a random number generator, 
which generates uniformly distributed random numbers over a sub-interval of 0 -7i 
radians. The sub-interval is determined based on the decoded voicing measure v and 
a stationarity measure C, (m) . A weighted sum of the elements of the nonstationary 
measure vector for the current frame is computed by 

Jo.55&(0) + 0.49ft(l) + 0.35&(2) + 0.21M(3) /* > 7 

^ [0.325R(0) + 0.325RQ) + 0.3291(2) + 0.32&(3) + 0.32&(4) /* < 7 * 

[00225] This is a scalar measure of the nonstationarity of the current frame. If r\ prev 

is the corresponding value for the previous frame, an interpolated stationarity measure 
is computed for each sub frame is obtained by: 
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M4X 



0.65, 



8 



8 

((8- "OH prev 



((8-m)ri +»m) 



/, >7 



l<m<8. 
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[00226] The sub-interval of [0 -n ] used for phase randomization is [—^- -7tp, ] , 



/; >7 and t;(m)< 1.0, 
/; >7a«^C(m)<3.0, 
4 >7 and t;(m)> 3.0. 
/* <7 arad C, (m) < 1 .0, 
/; <7a«c?C(m)<3.0, 
/; <7a«</C(m)>3.0. 



where p, is determined based on the following rule depending on the stationarity of 
the subframe: 

0.5- 0.25 C(m) 
0.25 + 0.0625(1 -<;(m)) 
0.125 

Pi =< 

1.0 

1.0 + 0.125(1 -C(m)) 
0.75 

[00227] As the subframe becomes more stationary (C, (m) relatively high valued), 
p, takes on lower values, thereby creating smaller values of random phase 
perturbation. As the stationarity of the subframe decreases, p, takes on higher values, 
resulting in higher values of random phase perturbation. Uniformly distributed 

random numbers in the interval [-^- ] are used as random phases. In addition, 

the sign of the the random phase at any given harmonic index is alternated from one 
update to the next, to remove any bias in phase randomization. The weighted phase 
combination of the random phase, previous phase and fixed phase is performed in two 
steps. In the 1 st step, the random phase and the previous phase are added directly 
resulting in a randomized previous phase vector: 

$ m (k) = y m _ x (k)+y m (*), 0<k<K m . (161) 
[00228] In the 2 nd step, the randomized phase vector as well as the fixed phase 
vector are combined with unity magnitude and a weighted vector addition is 



- 70 - 



performed. This results in a complex vector, which in general does not have unity 
magnitude: 

MW m (*)] = cosg M (*)>x 1 + cos(cp /tr (*))0 ~a 1 ), 
Im[C/; (A)] - sing. (*)>!, + sin((p /K (*))(! -a, ), 



0<k<K„ 



where, is a weighting factor determined based on the quantized voicing 
measure v and the stationarity measure ^ (m) computed by: 



a, 



0.5-0.2C(m) 
0.3 + 0.1(1 -C(m)) 
0.1 

1.0-0.2; (m) 

0.8 + 0.15(1 -^(m)) 

0.5 



/* >7 and^im) <1.0, 
/* >7 and' £ (m) < 3.0, 
/* >7W^(m)> 3.0. 
4 <7 WC(m)<1.0, 
/* <7^«^^(m)<3.0, 
Z* <7 WC(m)>3.0. 

[00229] As the subframe becomes more stationary (t> (m) relatively high valued), 
a } takes on lower values, increasing the contribution of the fixed phase vector. 
Conversely, as the stationarity of the subframe decreases, a t takes on higher values, 
increasing the contribution of the randomized phase. The resulting vector is 
normalized to unity magnitude as follows: 

U' m (k) 



0<k<K„ 



(164) 



Also, the phase of this vector is computed to serve as the previous phase during the 
next subframe: 

'infect 



q> m (k) = arctan 



0<k<K„ 



[00230] The above normalized vector is passed through an evolutionary low pass 
filter (i.e., low pass filtering along each harmonic track) to limit excessive variations, 
so that a signal having stationary characteristics (in the evolutionary sense) is 
obtained. Stationarity implies that variations faster than 25 Hz are minimal. However, 
due to phase models used and the random phase component it is possible to have 
excessive variations. This is undesirable since it produces speech that is rough and 



(165) 
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lacks naturalness during voiced sounds. The low pass filtering operation overcomes 
this problem. Delay constraints preclude the use of linear phase FIR filters. 
Consequently, second order IIR filters are employed. The filter transfer function is 
given by 



bo+b^z 1 +b 2 z 2 
\ + a^z + a 2 z 



(166) 



[00231] The filter parameters are obtained by interpolating between two sets of 
filter parameters. One set of filter parameters corresponds to a low evolutionary 
bandwidth and the other to a much wider evolutionary bandwidth. The interpolation 
factor is selected based on the stationarity measure (m) ), so that the bandwidth of 
the LPF constructed by interpolation between these two extremes allows the right 
degree of stationarity in the filtered signal. The filter parameters corresponding to low 
evolutionary bandwidth are: 



* v =!.«., =-1.77cos(^|),tf 2p =1.77, 
b op =\ll,b ]p = -0.2cos(|^), b 2p = 0.07 



(167) 



[00232] The filter parameters corresponding to high evolutionary bandwidth are: 

a oap = 1, a lap = -1.523326, a 2ap = 0.6494950, 
b oap = 0.395304917, b Xap =-0.367045695,^ =0.146146091. 

The interpolation parameter is computed based on the stationarity measure as follows: 



0.2 

0.2 + 0.2(1 -£(m)) 
0 

1.0 

1.0 + 0.32(1 -C(m)) 
0.2 



/* >7 andC,(m) <1.0, 
/* >7 and t;(m)< 2.0, 

/; >1 and ^{m)> 2.0. 

/* <J and^(m) <1.0, 
/* <7andt;(m)<3.5, 

/* <1 and<;(m)>3.5. 
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[00233] It is desirable to prevent excessive variations in cc 2 from one subframe to 
the next, as this would result in large variations in the filter characteristics. A 
modified interpolation parameter p 2 is computed by introducing hysteresis as follows: 



P 2 = 



\MIN[l.0^ 2pm +MAX(0.2V 2pnv ,0.0S)\ a 2 >$ 2prev + MAX(0.2\S 2prev ,0.05), 
MAX[0.0, $ 2prev -M4Z(0.3(3 2prev ,0.05)l a 2 <(3 2prev -M4X(0.3 (3^,0.05), 



a 2 otherwise. 



(170) 

Here, $ 2prev is the modified interpolation parameter |3 2 computed during the 
preceding subframe. The interpolated filter parameters are computed by: 

a j = P 2 «, fl/ ,+(l-p2K»] 

&, = PA,, + (i-p2)Vj 

The evolutionary low pass filtering operation is represented by 



7 = 0,1,2. (171) 



U m (k) = U" m (k) + b x U" m _ x (*) + b 2 U" m _ 2 (k) - a x U m _ x (*) - a 2 U m _ 2 (k), 0<k<K m , 0<m< 

(172) 

It should be noted that, if there is a pitch discontinuity, the filter state vectors, (i.e., 

^iW»^I-2(*)»^»-i(*Wiii-2W ) can re q u i re truncation, interlacing and/or 
decimation to align the vector elements such that the harmonic frequencies are paired 
with minimal discontinuity. This procedure is similar to that described for the 
previous phase vector above. 

[00234] The phase spectrum of the resulting stationary component vector U m (k) 

has the desired evolutionary characteristics, consistent with the stationary component 
of the residual signal at the encoder 100A. 

[00235] In the second step of phase construction, a nonstationary PW component 
is constructed, also using the decoded voicing measure v . The nonstationary 
component is expected to have some correlation with the stationary component. The 
correlation is higher for periodic signals and lower for aperiodic signals. To take this 
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into account, the nonstationary component is constructed by a weighted addition of 
the stationary component and a complex random signal. The random signal has unity 
magnitude at all the harmonics. 

[00236] In other words, only the phase of the random signal is randomized. In 
addition, the RMS value of the random signal is normalized such that it is equal to the 
RMS value of the stationary component, computed by: 



G = 



(173) 



[00237] The weighting factor used in combining the stationary and noise 
components is computed based on the voicing measure and the nonstationarity 
measure quantization index by: 

0.625 



0.775- 
0.835- 



1 + e -5(v-0 25) 

0.835 



1 + e 



-9(v-0 425) 



r R <7. 



(174) 



[00238] The weighting factor is increases with the periodicity of the signal. Thus, 
for periodic frames, the correlation between the stationary and nonstationary 
components is higher than for aperiodic frames. In addition, this correlation is 
expected to decrease with increasing frequency. This is incorporated by decreasing 
the weighting factor with increasing harmonic index: 



d 3 (k) = $ 3 - 



(0.5 + 0.5v)p 3 



k, 0<k<K n . 



(175) 



[00239] Thus, the weighting factor decreases linearly from p 3 at k = 0 to 
P 3 - (0.5 + 0.5v)P 3 at k — K m . The slope of this decrease is higher for aperiodic 

frames; i.e., for aperiodic frames the correlation with the stationary component starts 
at a lower value and decreases more rapidly than for periodic frames. The 
nonstationary component is then computed by: 

K (*) - 8 3 (k)U m (k) + [l - 3 3 (k)]G' s N' m (*), 0<k<K m . (176 
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Here {N' m (k),0 <k<K m } is the unity magnitude complex random signal and 
{R m (k),0 <k<K m } is the nonstationary PW component. 

[00240] The stationary and nonstationary PW components are combined by a 
weighted sum to construct the complex PW vector. The subband nonstationarity 
measure determines the frequency dependent weights that are used in this weighted 
sum. The weights are detemined such that the ratio of the RMS value of the 
nonstationary component to that of the stationary component is equal to the decoded 
nonstationarity measure within each subband. From equation 90, the band edges in Hz 
are defined by the array 

B„=[l 400 800 1600 2400 3400]. 
As in the case of the encoder 100 A, the subband edges in Hz are translated to subband 
edges in terms of harmonic indices such that the i' h subband contains harmonics with 
indices {r\(i -l)<k <rf (z), 1 < i < 5} : 



ii(0 = 



2 + 



1 + 



4000 
4000 
4000 



1 + 



4000 



< 



4000m 



4000 



otherwise. 



> 



4000m 



,0</<5. 



[00241] The energy in each subband is computed by averaging the squared 
magnitude of each harmonic within the subband. For the stationary component, the 
subband energy distribution for the m th subframe is computed by 

1 WH 2 

2(n(/)-ri(/-i)) b^.,) 1 

For the nonstationary component, the subband energy distribution for the 
m' h subframe is computed by 



ER m (l) = 



2(n(/)-ii(/-i)) M ttf-i) 



X \R m {k)\ l</<5. 



(179) 
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The subband weighting factors are computed by {r\ (i -V)<k <T\ (?'), 1 < J < 5} 



G **(*> = J!Ht' VA:i(/-l)<yt<ri(/), l</<5. (180) 

Since the bandedges exclude out-of-band components, it is necessary to explicitly 
initialize the weighting factors for the out-of-band components: 



(181) 



The complex PW vector can now be constructed as a weighted combination of the 
complex stationary and complex nonstationary components: 

K(k) = U m {k)^R m {k)G sb {k\ 0<k<K m9 l<m<8. (182) 

However, it should be noted that this vector will have the desired phase 
characteristics, but not the decoded PW magnitude. To obtain a PW vector with the 
decoded magnitude and the desired phase, it is necessary to normalize the above 
vector to unity magnitude and multiply it with the decoded magnitude vector: 

K(k) = p^~P m (k) 9 0<k<K m , l<m<8. (183) 

This vector is the reconstructed (normalized) PW magnitude vector for sub frame m . 
[00242] The inverse quantized PW vector may have high valued components 
outside the band of interest. Such components can deteriorate the quality of the 
reconstructed signal and should be attenuated. At the high frequency end, harmonics 
above 3400 Hz are attenuated. At the low frequency end, only the DC component 
(i.e., the 0 Hz component) is attenuated. The attenuation characteristic is linear from 1 
at the bandedge to 0 at 4000 Hz. The attenuation process can be specified by: 
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0 

K(k) 



4000(7t -fcoj 

6007T 



k = 0, 
l<k<k. 



k,.„ <k<K„. 



where, k um is the index of the lowest pitch harmonic that falls above 3400 Hz. It is 
obtained by 



3400 ^ 
4000 " 



+ 1. 



(185) 



[00243] Certain types of background noise can result in LP parameters that 
correspond to sharp spectral peaks. Examples of such noise are babble noise and 
interfering talker. Peaky spectra during background noise is undesirable since it leads 
to a highly dynamic reconstructed noise that interferes with the speech signal. This 
can be mitigated by a mild degree of bandwidth broadening that is adapted based on 
the RVAD _ FLAG _ FINAL computed according to table 3.6.3-3. Bandwidth 
broadening is also controlled by the nonstationarity index. If the index takes on values 
above 7, indicating an voiced frame, no bandwidth broadening is applied. For values 
of the nonstationarity index 7 or lower, a bandwidth broadening factor is selected 
jointly with the RVAD _ FLAG _ FINAL according to the following equation: 

cp = (D(2i? VAD _FLAG _ FINAL + VM _ INDEX) (186) 
where VM INDEX is related to l* R as follows: 

VM _ INDEX = MIN(3, MAX(0, (/* -5))) (187) 
and the 9-dimensional array Ois defined as follows in Table 3: 



O(0) 


<D(1) 


0(2) 


0(3) 


0(4) 


0(5) 


<D(6) 


<D(7) 


0(8) 


0.96 


0.96 


0.96 


0.97 


0.975 


0.98 


0.99 


0.99 


0.99 



TABLE 3 



[00244] Bandwidth broadening is performed only during intervals of voice 
inactivity. Bandwidth expansion increases as the frame becomes more unvoiced. 
Onset and offset frames have a lower degree of bandwidth broadening compared to 
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frames during voice inactivity. Bandwidth expansion is applied to interpolated LPC 
parameters as follows: 

a' m (j) = a m Uy? m 0<m<\0, 1< 7 <8. (188) 
[00245] The level of the PW vector is restored to the RMS value represented by 
the decoded PW gain. Due to the quantization process, the RMS value of the decoded 
PW vector is not guarenteed to be unity. To ensure that the right level is achieved, it is 
necessary to first normalize the PW by its RMS value and then scale it by the PW 
gain. The RMS value is computed by 



g - (m) = J^^^|twf 1***8. (189) 
\2K m +2* =0 

The PW vector sequence is scaled by the ratio of the PW gain and the RMS value for 
each sub frame: 

t(k) = ^^K(k) 0<k<K m ,l<m<S. (190) 

[00246] The excitation signal is constructed from the PW using an interpolative 
frequency domain synthesis process. This process is equivalent to linearly 
interpolating the PW vectors bordering each subframe to obtain a PW vector for each 
sample instant, and performing a pitch cycle inverse DFT of the interpolated PW to 
compute a single time-domain excitation sample at that sample instant. 
[00247] The interpolated PW represents an aligned pitch cycle waveform. This 
waveform is to be evaluated at a point in the pitch cycle (i.e., pitch cycle phase), 
advanced from the phase of the previous sample by the radian pitch frequency. The 
pitch cycle phase of the excitation signal at the sample instant determines the time 
sample to be evaluated by the inverse DFT. Phases of successive excitation samples 
advance within the pitch cycle by phase increments determined by the linearized pitch 
frequency contour. 

[00248] The computation of the n th sample of the excitation signal in the m th sub- 
frame of the current frame can be conceptually represented by 
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e(20(m - 1) + n) = ^ £ t 20 " (*) + n ^ (20(m ~ 1)+w) * , 0 < n < 20,0 < m < 8,0 < 

20 (^s« +l)*-o 

(191) 

where, 9 {2§{m - 1) + n) is the pitch cycle phase at the « rA sample of the excitation in 

the m th sub-frame. It is recursively computed as the sum of the pitch cycle phase at 
the previous sample instant and the pitch frequency at the current sample instant: 
6 (20(m - 1) + n) = 9 (20(m - 1) + n - 1) + co (20(m - 1) + n\ 0 < n < 20 (192) 
[00249] This is essentially a numerical integration of the sample-by-sample pitch 
frequency track to obtain the sample-by-sample pitch cycle phase. It is also possible 
to use trapezoidal integration of the pitch frequency track to get a more accurate and 
smoother phase track by 

9 (20(m -l) + n)=Q (20(m - 1) + n - 1) + 0.5[oS (20(m - 1) + n - 1) + co (20(™ - 1) + n)} 0<n<20 

(193) 

[00250] In either case, the first term circularly shifts the pitch cycle so that the 
desired pitch cycle phase occurs at the current sample instant. The second term results 
in the exponential basis functions for the pitch cycle inverse DFT. 
[00251] The approach above is a conceptual description of the excitation synthesis 
operation. Direct implementation of this approach is possible, but is highly 
computation intensive. The process can be simplified by using radix-2 FFT to 
compute an oversampled pitch cycle and by performing interpolations in the time 
domain. These techniques have been employed to achieve a computation efficient 
implementation. 

[00252] The resulting excitation signal {e(n),0 <n< 160} is processed by an all- 
pole LP synthesis filter, constructed using the decoded and interpolated LP 
parameters. The first half of each sub-frame is synthesized using the LP parameters at 
the left edge of the sub-frame and the second half by the LP parameters at the right 
edge of the sub-frame. This ensures that locally optimal LP parameters are used to 
reconstruct the speech signal. The transfer function of the LP synthesis filter for the 
first half of the m th sub frame is given by 
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JW*)=-io — ( 194 > 

£a;(m-l)z-' 

and for the second half 

= 15"^ (195) 

1=0 

The signal reconstruction is expressed by 



s(20(m-Y) + n) 



e(20(m - 1) + n) - ]T a\{m - l)s(20(/w - 1) + n - /), 0 < n < 1 0, 0 < ™ < 8 

10 

£(20(> - 1) + n) - 5] a 7 r (m)s(20(m -V) + n-l), 1 0 < n < 20, 0 < /w < 8. 



/=i 



(196) 



The resulting signal {s(n),Q <n< 160} is the reconstructed speech signal. 
[00253] The reconstructed speech signal is processed by an adaptive postfilter to 
reduce the audibility of the effects of modeling and quantization. A pole-zero 
postfilter with an adaptive tilt correction is employed as disclosed in "Adaptive 
Postfiltering for Quality Enhancement of Coded Speech", IEEE Transactions on 
Speech and Audio Processing, Vol. 3, No. 1, pages 59-71, January 1995 by J.H. Chen 
and A. Gersho which is incorporated by reference in its entirety. 
[00254] The postfilter emphasizes the formant regions and attenuates the valleys 
between formants. As during speech reconstruction, the first half of the sub-frame is 
postfiltered by parameters derived from the LPC parameters at the left edge of the 
sub-frame. The second half of the sub-frame is postfiltered by the parameters derived 
from the LPC parameters at the right edge of the sub-frame. For the m th sub-frame, 
these two postfilter transfer functions are specified respectively by 
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/=0 

and 



2»-i)P^-' 

H pn {z) = ^ (197) 



£a;(m)p>-' 



H„n{z) = ^ 0 (198) 



Y j a\{m)a' pf z- 1 

/=o 



The pole-zero postfiltering operation for the first half of the sub-frame is represented 
by 

10 10 

s pfx (20(m - 1) + n) = £ a\ {m - 1) p ^(20(m - 1) + n ~ I) - £ a) (w - l)a (20(m -!) + *-/ 

/=] /=1 

0</?<10 ? 0<ra<8. 

(199) 

The pole-zero postfiltering operation for the second half of the sub-frame is 
represented by 

10 10 

s pf] (20(m - 1) + n) = £ fl J (m) p ^(20(m - 1) + «-/)-£ aj (m)a (20(m - 1) + n - /), 

/=i /=i 

10<rc<20, 0<™<8. 

(200) 

where, a pf and p p/ are the postfilter parameters. These satisfy the constraint 
0<p /7/ <a p/ .<l.A typical choice for these parameters is a pf = 0.875 and 

p p/ =0.6. 

[00255] The postfilter introduces a frequency tilt with a mild low pass 
characteristic to the spectrum of the filtered speech, which leads to a muffling of 
postfiltered speech. This is corrected by a tilt-correction mechanism, which estimates 
the spectral tilt introduced by the postfilter and compensates for it by a high frequency 
emphasis. A tilt correction factor is estimated as the first normalized autocorrelation 
lag of the impulse response of the postfilter. Let v pfx and v pf2 be the two tilt 
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correction factors computed for the two postfilters in equations 197 and 198, 
respectively- Then the tilt correction operation for the two half sub-frames are as 
follows: 



s pf (20(m -!) + ») = 



s pfl (20(m -!) + «)- 0.& pf] s pfl (20(m - 1) + n - 1), 0 < n < 10,0 < #w 

s pfx (20(m - 1) + w) - 0.8v p/2 i p/1 (20(m - 1) + n - 1), 10 < n < 20,0 < 

(201) 

[00256] The postfilter alters the energy of the speech signal. Hence it is desirable to 
restore the RMS value of the speech signal at the postfilter output to the RMS value of 
the speech signal at the postfilter input. The RMS value of the postfilter input speech 
for the m th sub-frame is computed by: 



° prepf (™) = ^ E ^ (2°0» - 0 + ») 0<W ^ 8 ( 202 ) 



The RMS value of the postfilter output speech for the m' A sub-frame is computed 
by: 



V «=o 



(203) 



An adaptive gain factor is computed by low pass filtering the ratio of the RMS 
value at the post filter input to the RMS value at the post filter output: 



g pf (20(m - 1) + n) = 0.96^ (20(m - 1) + #i - 1) + 0.04 



prepf \ 



(m) 



(m) 



0<n< 20,1 < m < 8. 
(204) 



[00257] The postfiltered speech is scaled by the gain factor as follows: 

s out (20(m - 1) + n) = (20(w - 1) + «)V (20(m - 1) + n) 9 0 < r < 20, 0 < m 
The resulting scaled postfiltered speech signal {s out (n),0 < n < 160} constitutes one 
frame (20 ms) of output speech of the decoder correponding to the received 80 bit 
packet. 
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Those skilled in the art can now appreciate from the foregoing description that the 
broad teachings of the present invention can be implemented in a variety of forms. 
Therefore, while this invention has been described in connection with particular 
examples thereof, the true scope of the invention should not be so limited since other 
modifications will become apparent to the skilled practitioner upon a study of the 
drawings, specification and the following claims. 



